About once a year I need to hash some data to give it a name / short checksum. I don’t need a cryptographic hash. But I end up using SHA-1 or MD5 anyway because it’s so available. A non-crypto hash should be ~100x faster to process data, which could actually matter. But finding a standard, portable, non-cryptographic hash with optimized implementations for Python, Node, etc is a challenge.
A lot of history of hash functions is producing 32 bit values on 32 bit CPUs. That stuff is mostly obsolete. 32 bit hashes are not useful for many applications; you expect hash collisions after just 2^16 ~= 65,000 objects. And 32 bit CPUs are largely irrelevant now. You really want a micro-optimized machine code implementation for AMD64 that is aware of cache geometry, pipeline optimizations, etc. It may be now that you want a GPU implementation; you certainly do for high throughput crypto hashes. Not clear if it matters as much for non-crypto hashes and deployment realities mean you may not have a GPU at all.
Stock Python has the hashlib library, which includes crypto hashes like MD5 and SHA-1. It’s highly optimized. There’s also CRC32 and Adler32 hiding in the zlib library, but they are only 32 bit hashes and therefore not useful.
MurmurHash3 is the usual recommendation for a non-cryptographic hash. It’s been around a long time and widely implemented. It’s also a bit confusing, with both 32 and 128 bit versions and multiple Python implementations.
SMHasher is the standard test and benchmark suite for hash performance. I’d love to read a carefully done comparison of hash functions out there in contemporary languages and deployment environments.