Python hash table randomisation

This is an incomplete, abbreviated and under researched history of hash table randomisation in Python. It's intended to be accessible and high level. For technical details please use the references provided. Now, a good place to start is, what is a hash table and why would Python want to randomise it?

A hash table is the data structure used to implement Python dictionaries. A hash table can efficiently look up a value by mapping the key to a specific position. An important property of hash tables is how they handle collision resolution. The performance of a hash table is predicated on there being few collisions between multiple keys mapping to the same position. When there are a large number of collisions, performance can degrade dramatically.

It's this degradation in performance that was the basis of a 2011 presentation on efficient denial of service attacks. Denial of service, or DOS attacks, on hash tables weren't exactly new. But practical attacks on real world implementations were. Naturally, in addition to the security list, it was reported to python-dev and issue13703 was opened.

There are different solutions to this problem. Use a different collision resolution method is one. Provide an alternate data structure, such as a trie, is another. (Although I wouldn't use the one I wrote just yet) The generally accepted solution was to introduce some randomisation into the algorithm that maps keys to positions.

To achieve this randomisation, keys are passed through a hash function before placing or retrieving values. The hash function is seeded with a random value when Python starts. This makes it more difficult for an attacker to predict collisions. Hopefully so difficult as to render DOS attacks on the hash table impractical. Hash randomisation was enabled by default in Python 3.3.

Unfortunately the story doesn't end there. The algorithm used to randomise hash tables didn't quite work. Too few effective seeds ended up being used, making it possible to test for which seed was current. issue14621 was opened to report this and a presentation on revisiting hash-flooding dos attacks was given later that year. While this presentation focused on attacking the MumurHash and CityHash64 hash functions, proof of concept code was shown for attacking Python's new hash table randomisation.

In addition to demonstrating weaknesses in existing hash functions for hash table randomisation, the presenters announced new algorithm called SipHash. SipHash is a cryptographic hash function. Unlike the hash functions above, it should be practically impossible to determine the input from the output of a cryptographic hash function. This mitigates the types of attacks previously demonstrated.

Other issues with the Python implementation also exist. The algorithm performs poorly and the implementation is difficult to replace with an alternatives more appropriate for embedded systems. To discuss these issues and possible solutions, Christian Heimes drafted PEP456. Its currently working towards acceptance.

Early test results from Christian of using SipHash in Python show positive results for both x86 and x64. Hopefully PEP456 will achieve acceptance in time for Python 3.4's release in February next year. Then this story can come to an end.