toofishes.net

Fast unicode decoding in Python 2.7

The codecs module in Python provides an elegant wrapper class, codecs.EncodedFile, that “provides transparent encoding translation” on any file-like object you provide to it. Unfortunately, it can also be less than speedy when translating a lot of data.

Fortunately, in Python 2.7, there is a fast alternative- the io module, basically a backport of the default Python 3.X files/streams library. The Cython Utils package pointed me in the right direction as to how to use this module; I’ve included a quick demo script below with timings to show how effective switching can be.

Note: I mention Python 2.7 here; 2.6 introduced the io module but it is written in pure Python and was not written in C until 2.7. Thus, it likely is the same speed or slower than the codecs module in previous versions.

# -*- coding: utf-8 -*-
import codecs
import io
import sys
import tarfile

def parse_archive(path, module):
    use_codecs = module == 'codecs'
    print("Using codecs module: %s" % use_codecs)
    archive = tarfile.open(path, "r")
    for tarinfo in archive.getmembers():
        if tarinfo.isreg():
            data_file = archive.extractfile(tarinfo)
            if use_codecs:
                data_file = codecs.EncodedFile(data_file, 'utf-8')
            else:
                data_file = io.TextIOWrapper(io.BytesIO(data_file.read()),
                        encoding='utf=8')
            try:
                for line in data_file:
                    # here is where you would do the normal work
                    pass
            except UnicodeDecodeError as e:
                print("Could not decode %s, skipping file" % tarinfo.name)
            data_file.close()

    archive.close()

if __name__ == '__main__':
    parse_archive(sys.argv[1], sys.argv[2])

# vim: set ts=4 sw=4 et:

This script is based off of the reporead script used in archweb, the Arch Linux main site. The example compressed tar file used in this case is 4.7 MB and around 50 MB uncompressed and contains 10620 files.

$ /usr/bin/time python2 decode_test.py /tmp/updaterepos/i686/extra.files.tar.gz codecs
Using codecs module: True
18.62user 0.01system 0:18.65elapsed 99%CPU (0avgtext+0avgdata 118672maxresident)k
0inputs+0outputs (0major+7557minor)pagefaults 0swaps

$ /usr/bin/time python2 decode_test.py /tmp/updaterepos/i686/extra.files.tar.gz io
Using codecs module: False
2.42user 0.04system 0:02.47elapsed 99%CPU (0avgtext+0avgdata 140240maxresident)k
0inputs+0outputs (0major+15243minor)pagefaults 0swaps

When a simple change like this results in an 8x speedup, I think it is worth the switch. I made a commit to archweb switching from codecs to io if available after finding and squashing this bottleneck.

On a related note, the script above is python 3 compatible, so I tried running the test there as well. With the io module; the test took 2.6 seconds; codecs never completed and I waited over 14 minutes before completing it. I think something is wrong there and might need to get reported.

Tags

See Also