A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
> Like addition
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
This is a excellent article for anyone looking for some more in-depth analysis of tabulation based hashing methods: https://arxiv.org/abs/1505.01523
a hash function that produce hashes that are already in the hash table should, IMO, not be called a hash function.
I get why technically it is a hash function, but still, no.
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
ok damn, I did not know this, obviously. Thanks.
A perfect hash function https://en.wikipedia.org/wiki/Perfect_hash_function has to be specially constructed for each desired set of inputs. Generic hash functions cannot be 'perfect'.
Here is a hash function that does not have hash collisions:
what programming language is this?
Well it no longer constrains the data in a fixed output length.
Sure, but if you constrain to fixed output length, you will definitely have collisions (Pigeon Hole Principle). There's no way around that.