File checksums in Python: the hard way - Shane Kerr¶

(Summary of a talk at the Amsterdam python meetup)

Shane is data hoarder. He hates losing data. He has lots of data. He could move it to the cloud, but he doesn’t trust that: cloud companies go away after a time. And: disks are cheap now, so he doesn’t need to.

But: bad things happen to good data. Disks go corrupt. You can use checksums to detect problems.

In the ideal work, the os/hardware detects problems and everything just works. But it isn’t true for linux RAID (which doesn’t do checks). ext4 doesn’t do checking at all. ZFS could work, but the license is deliberately linux-GPL-unfriendly.

So: he wants proper checksums. You could use sha1sum in a shell script and pipe the checksums to a file. But the simple case doesn’t handle metadata. And it isn’t parallel.

So… he started a python program. Python is “batteries included”: it has a great standard library. For instance os.walk() which goes through all the directories and files. With os.path.join() and os.path.normpath() you get usable full filepaths. With hashlib.sha224() he could get a hash per file.

He tried it out with various python versions:

python3 (cpython3) is the base.
python 2: works, but for new programs he wouldn’t bother supporting it. Python 2 is end-of-life in two years’ time.
pypy: normally much faster, but not in this case.
jython: worked, with a few small changes.
ironpython: didn’t work. Missing libraries.

Issue: localization of file names. The problem is that file systems don’t have language settings. Linux ext4 is often UTF-8, but it isn’t mandatory. NTFS and VFAT are often UTF-16.

Python 3’s standard libraries try to be smart. If you ask for files in b'/some/dir', you get byte strings. With '/some/dir', you get strings (unicode), unless there’s something that isn’t properly encodable.

There are also problems with timestamps: FAT (still used on USB sticks!) has a two-second-resolution. Modern OSs have a nanosecond resolution. Python by default doesn’t use a float type that has enough resolution for that. There are ways around it.

Which algorithms?

Checksums?
CRC?
Hash function?
Cryoptographic hashing? md5: workable, but old. Would be fine for this purpose, but it would mean a great deal of explaining (because there have been attacks on it).

He uses SHA-224 (SHA-2). Today he would have used BLAKE2.

Now the multiprocessing model of the program. They splitted it into the major CPU-bound workloads:

Main thread: os.walk, executing the ‘stat’ calls (for metadata). IO bound.
Worker threads (calculate hash values).
Serializer thread (outputs values in the correct order).

Actually, he uses multiprocessing, not threading`. So it runs multiple processes instead of threads, which avoids Python’s GIL (“global interpreter lock”).

The program is at https://github.com/shane-kerr/fileinfo

Two years ago he did a fun experiment by using python’s tar module: https://github.com/shane-kerr/fv . Tar already calculates checksums and as long as you don’t actually put the file contents into the tarfile, but instead inject a short comment about the file…

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):