Non-cryptographic hashes

 

About once a year I need to hash some data to give it a name / short checksum. I don’t need a cryptographic hash. But I end up using SHA-1 or MD5 anyway because it’s so available. A non-crypto hash should be ~100x faster to process data, which could actually matter. But finding a standard, portable, non-cryptographic hash with optimized implementations for Python, Node, etc is a challenge.

A lot of history of hash functions is producing 32 bit values on 32 bit CPUs. That stuff is mostly obsolete. 32 bit hashes are not useful for many applications; you expect hash collisions after just 2^16 ~= 65,000 objects. And 32 bit CPUs are largely irrelevant now. You really want a micro-optimized machine code implementation for AMD64 that is aware of cache geometry, pipeline optimizations, etc. It may be now that you want a GPU implementation; you certainly do for high throughput crypto hashes. Not clear if it matters as much for non-crypto hashes and deployment realities mean you may not have a GPU at all.

Stock Python has the hashlib library, which includes crypto hashes like MD5 and SHA-1. It’s highly optimized. There’s also CRC32 and Adler32 hiding in the zlib library, but they are only 32 bit hashes and therefore not useful.

MurmurHash3 is the usual recommendation for a non-cryptographic hash. It’s been around a long time and widely implemented. It’s also a bit confusing, with both 32 and 128 bit versions and multiple Python implementations.

There’s a bunch of “nextgen” hash functions out there. Metrohash and xxHash both look promising on first glance.

SMHasher is the standard test and benchmark suite for hash performance. I’d love to read a carefully done comparison of hash functions out there in contemporary languages and deployment environments.

5 thoughts on “Non-cryptographic hashes

  1. I can’t believe you’re dissing 32 bit hashes, the hardest working hashes in show business like that! It’s still by far and away the most common use of hashes – hashtables. Most of those 32 bits get thrown away anyway in most cases and the data structures are designed with collisions in mind.

    More seriously, I think the reason you see relatively little work for the sort of purpose you have in mind is that when they retire, old crypto hashes are still excellent general purpose hashes with carefully optimized and tuned high-speed implementations. They’re so fast, for most practical purposes, there’s not much point looking at anything else. On my dinky laptop, openssl spews out 800 Mb/sec of sha1 on 8k sized blocks, per core – about the top speed of a decent PCIe attached SSD (you can try this with openssl speed sha1). Latest Intel CPUs (Skylake) have SHA1/256-assisting instructions and are faster. It’s probably faster than you can typically bring data into a node or python runtime as it is.

  2. Yeah I should clarify I’m mostly interested in content hashes. I want a statistically unique fingerprint for relatively large data. Ie: hash a file’s contents so I can tell if I’ve seen the file already. 32 bit hashes are just fine for smaller hash tables or things where collisions are acceptable. There’s really little reason to resist the common crypto hashes other than the theoretical idea of speed.

  3. Right, and another aspect I forgot to type is that since most systems these days deal with unauthenticated (i.e. potentially malicious) data, the preimage resistance of even old crypto hashes is an attractive, zero-effort security bonus. J Random Internet Person isn’t going to submit a sha1 colliding dataset and break your system. It also makes me wonder where the wide versions of the high-performance non-crypto hashes are used in practice. Are there a zillion cores somewhere in the bowels of Google busily computing CityHash128’s? And on what?

  4. Due to inability to resist a dumb benchmark:

    In theory, metrohash (and the like) should be in the 10-20ish times faster range. In practice, it seems to be about 2x. If you squint at the user/system breakdown, you might be able to convince yourself it’s closer to 3x if it weren’t for the crappy copy-required metrohash binding interface. openssl invoked on the command line gets this done for sha1 as fast as metrohash does it from python. Times are from primed cache and i/o-less. Not shown: openssl md5 is slower than sha1 just about always.

  5. Thanks for doing the quick test! I’m convinced the quality of the machine language implementation is the most important factor here, and that performance is highly dependent on the specifics of the machine architecture. Which to me means “don’t try to be smart, just use the common thing everyone uses”, which speaks to SHA-1 or whatever else SSL tends to use.

Comments are closed.