(Summary of a talk at the Amsterdam python meetup)
Shane is data hoarder. He hates losing data. He has lots of data. He could move it to the cloud, but he doesn’t trust that: cloud companies go away after a time. And: disks are cheap now, so he doesn’t need to.
But: bad things happen to good data. Disks go corrupt. You can use checksums to detect problems.
In the ideal work, the os/hardware detects problems and everything just works. But it isn’t true for linux RAID (which doesn’t do checks). ext4 doesn’t do checking at all. ZFS could work, but the license is deliberately linux-GPL-unfriendly.
So: he wants proper checksums. You could use
sha1sum in a shell script and
pipe the checksums to a file. But the simple case doesn’t handle metadata. And
it isn’t parallel.
So… he started a python program. Python is “batteries included”: it has a
great standard library. For instance
os.walk() which goes through all the
directories and files. With
get usable full filepaths. With
hashlib.sha224() he could get a hash per
He tried it out with various python versions:
python3 (cpython3) is the base.
python 2: works, but for new programs he wouldn’t bother supporting it. Python 2 is end-of-life in two years’ time.
pypy: normally much faster, but not in this case.
jython: worked, with a few small changes.
ironpython: didn’t work. Missing libraries.
Issue: localization of file names. The problem is that file systems don’t have language settings. Linux ext4 is often UTF-8, but it isn’t mandatory. NTFS and VFAT are often UTF-16.
Python 3’s standard libraries try to be smart. If you ask for files in
b'/some/dir', you get byte strings. With
'/some/dir', you get strings
(unicode), unless there’s something that isn’t properly encodable.
There are also problems with timestamps: FAT (still used on USB sticks!) has a two-second-resolution. Modern OSs have a nanosecond resolution. Python by default doesn’t use a float type that has enough resolution for that. There are ways around it.
Cryoptographic hashing? md5: workable, but old. Would be fine for this purpose, but it would mean a great deal of explaining (because there have been attacks on it).
He uses SHA-224 (SHA-2). Today he would have used BLAKE2.
Now the multiprocessing model of the program. They splitted it into the major CPU-bound workloads:
Main thread: os.walk, executing the ‘stat’ calls (for metadata). IO bound.
Worker threads (calculate hash values).
Serializer thread (outputs values in the correct order).
Actually, he uses
multiprocessing, not threading`. So it runs multiple
processes instead of threads, which avoids Python’s GIL (“global interpreter
The program is at https://github.com/shane-kerr/fileinfo
Two years ago he did a fun experiment by using python’s
https://github.com/shane-kerr/fv . Tar already calculates checksums and as
long as you don’t actually put the file contents into the tarfile, but instead
inject a short comment about the file…
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):