Fast unicode decoding in Python 2.7
The codecs module in Python provides an elegant wrapper class, codecs.EncodedFile
, that “provides transparent encoding translation” on any file-like object you provide to it. Unfortunately, it can also be less than speedy when translating a lot of data.
Fortunately, in Python 2.7, there is a fast alternative- the io module, basically a backport of the default Python 3.X files/streams library. The Cython Utils package pointed me in the right direction as to how to use this module; I’ve included a quick demo script below with timings to show how effective switching can be.
Note: I mention Python 2.7 here; 2.6 introduced the io
module but it is written in pure Python and was not written in C until 2.7. Thus, it likely is the same speed or slower than the codecs
module in previous versions.
# -*- coding: utf-8 -*-
import codecs
import io
import sys
import tarfile
def parse_archive(path, module):
use_codecs = module == 'codecs'
print("Using codecs module: %s" % use_codecs)
archive = tarfile.open(path, "r")
for tarinfo in archive.getmembers():
if tarinfo.isreg():
data_file = archive.extractfile(tarinfo)
if use_codecs:
data_file = codecs.EncodedFile(data_file, 'utf-8')
else:
data_file = io.TextIOWrapper(io.BytesIO(data_file.read()),
encoding='utf=8')
try:
for line in data_file:
# here is where you would do the normal work
pass
except UnicodeDecodeError as e:
print("Could not decode %s, skipping file" % tarinfo.name)
data_file.close()
archive.close()
if __name__ == '__main__':
parse_archive(sys.argv[1], sys.argv[2])
# vim: set ts=4 sw=4 et:
This script is based off of the reporead script used in archweb
, the Arch Linux main site. The example compressed tar file used in this case is 4.7 MB and around 50 MB uncompressed and contains 10620 files.
$ /usr/bin/time python2 decode_test.py /tmp/updaterepos/i686/extra.files.tar.gz codecs
Using codecs module: True
18.62user 0.01system 0:18.65elapsed 99%CPU (0avgtext+0avgdata 118672maxresident)k
0inputs+0outputs (0major+7557minor)pagefaults 0swaps
$ /usr/bin/time python2 decode_test.py /tmp/updaterepos/i686/extra.files.tar.gz io
Using codecs module: False
2.42user 0.04system 0:02.47elapsed 99%CPU (0avgtext+0avgdata 140240maxresident)k
0inputs+0outputs (0major+15243minor)pagefaults 0swaps
When a simple change like this results in an 8x speedup, I think it is worth the switch. I made a commit to archweb switching from codecs to io if available after finding and squashing this bottleneck.
On a related note, the script above is python 3 compatible, so I tried running the test there as well. With the io
module; the test took 2.6 seconds; codecs
never completed and I waited over 14 minutes before completing it. I think something is wrong there and might need to get reported.
See Also
- Arch Package Visualization - June 23, 2011
- Django South graphmigrations - February 16, 2011
- python-pgpdump, a PGP packet parser library - March 8, 2012
- Unstated coding style - December 21, 2011
- Music metadata visualization in Python - March 22, 2011