I have spent quite some time over the last couple of months improving the Haskell hashable library, and all of my efforts eventually turned into a near-complete rewrite of the library.
The 1.2 release of hashable is not backwards compatible, and for several good reasons. Read on for the details.
Hash flooding
The threat landscape faced by applications that use hash-based data structures has shifted radically this year, with hash flooding emerging as an attack vector after a long period of obscurity (the earliest publication I know of on the subject dates back to 1998, but I doubt that it was new even then).
We know that both networked and local applications that use weak hash algorithms are vulnerable, with new attacks being published all the time.
Recently, even modern hash algorithms such as the MurmurHash family and CityHash have succumbed to surprising attacks via differential cryptanalysis, in which large numbers of collisions can be generated efficiently without knowledge of the salt.
SipHash
The SipHash algorithm offers a strong defence against differential cryptanalysis. SipHash is now the algorithm used for hashing all string-like types.
We automatically choose optimized implementations for 64-bit and 32-bit platforms, depending on their capabilities (e.g. availability of SSE2 on 32-bit systems).
For inputs of just a few bytes in length, SipHash is about half the speed of the previous string hash algorithm, FNV-1. Its performance breaks even with FNV-1 at about 50 bytes, and it improves to become more than twice as fast for large inputs.
(FNV-1 is also particularly easy to attack, so escaping from it was imperative, even if that cost a little performance.)
Random choice of salt
To provide an additional measure of security against hash flooding attacks, the standard hash
function chooses a random salt at program startup time, using the system’s cryptographic pseudo-random number generator (/dev/urandom
on Unix, CryptGenRandom
on Windows).
Effortless, fast hashing
I have also demoted hash
out of the Hashable
typeclass, so the API now looks like this.
class Hashable a where
hashWithSalt :: Int -- salt
-> a -- value to hash
-> Int -- result
hash :: Hashable a => a -> Int
hash = hashWithSalt defaultSalt
In other words, if you are writing an instance of Hashable
by hand, you had better pay attention to the salt, and if you use the friendly hash
function, you get the randomly-chosen default salt every time.
Now that writing Hashable
instances by hand is slightly more painful, I have also made it almost effortless to hash your custom datatypes.
Using GHC’s no-longer-so-new generics machinery, you can have an efficient Hashable
instance generated “for free” with just a few lines of code. Good code is generated for both product and sum types.
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
data MyType = MyStr String
| MyInt Integer
deriving (Generic)
instance Hashable MyType
If your types are polymorphic, no problem; they’re just as easy to generate hash functions for.
data Poly a = Poly a
deriving (Generic)
instance Hashable a => Hashable (Poly a)
Improved avalanche for basic types
We do not currently use SipHash to hash basic types such as numbers, as strings are by a huge margin the most common vector for hash flooding attacks.
Nevertheless, we’ve made improvements to our hashing of basic types. Previous versions of hashable did nothing at all, meaning that the hash of a number was simply the identity function. While (obviously) fast and good for locality of reference in a few narrow cases, the identity function makes a terrible hash for many data structures.
For instance, if a hash table uses open addressing with linear probing, the identity hash can cause quadratic performance when inserting values into the table that are identical modulo the table size.
It’s generally considered desirable for a hash function to have strong “avalanche” properties, meaning that a single-bit change in an input should cause every bit of the output to have a 50% probability of changing.
For numberic types, we now use a couple of algorithms developed by Thomas Wang that are both fast and have good avalanche properties. These functions are more expensive than doing nothing, but quite a lot faster than SipHash. If it turns out that integers become a popular vector for hash flooding, we may well switch to SipHash for everything at some point.
In summary
This release of the hashable library is a fairly big deal. While the performance implications aren’t entirely happy, I’ve done my best to write the fastest possible code.
I hope you’ll agree that the improvements in ease of use, security, and applicability (thanks to no longer using identity hashes) are more than worth the cost. Happy hacking!
Leave a Reply