Preparing for Mavericks

Apple has released OSX 10.9, also known as Mavericks. Which means it's time to prepare for an operating system upgrade again. In the interest of soliciting feedback, I'm sharing the rough process I'm using for this upgrade.

  • Update all software. In addition to installing all pending App Store updates, I make sure all any applications that have auto-updaters are started and they check for new updates. In particular this means running Mail to ensure GPGTools updates GPGMail.
  • Backup all important data. Most of my important but not confidential data is already stored remotely on GitHub and Dropbox. I'm in the process of setting up Arq to backup the remainder.
  • Update and upgrade Homebrew. Rather than just upgrade Homebrew, for this upgrade I've decided to completely uninstall and reinstall by following the uninstall instructions on the Homebrew FAQ and the commands in this Gist.
  • Compile a list of post upgrade procedures. Currently, the only process I've recorded is to use xcode-select to install the latest command line tools.

It's hardly a detailed or rigorous plan. If you do have feedback, or think any thing is missing, please leave a comment below. Alternatively, there is also Twitter or email.


Python hash table randomisation

This is an incomplete, abbreviated and under researched history of hash table randomisation in Python. It's intended to be accessible and high level. For technical details please use the references provided. Now, a good place to start is, what is a hash table and why would Python want to randomise it?

A hash table is the data structure used to implement Python dictionaries. A hash table can efficiently look up a value by mapping the key to a specific position. An important property of hash tables is how they handle collision resolution. The performance of a hash table is predicated on there being few collisions between multiple keys mapping to the same position. When there are a large number of collisions, performance can degrade dramatically.

It's this degradation in performance that was the basis of a 2011 presentation on efficient denial of service attacks. Denial of service, or DOS attacks, on hash tables weren't exactly new. But practical attacks on real world implementations were. Naturally, in addition to the security list, it was reported to python-dev and issue13703 was opened.

There are different solutions to this problem. Use a different collision resolution method is one. Provide an alternate data structure, such as a trie, is another. (Although I wouldn't use the one I wrote just yet) The generally accepted solution was to introduce some randomisation into the algorithm that maps keys to positions.

To achieve this randomisation, keys are passed through a hash function before placing or retrieving values. The hash function is seeded with a random value when Python starts. This makes it more difficult for an attacker to predict collisions. Hopefully so difficult as to render DOS attacks on the hash table impractical. Hash randomisation was enabled by default in Python 3.3.

Unfortunately the story doesn't end there. The algorithm used to randomise hash tables didn't quite work. Too few effective seeds ended up being used, making it possible to test for which seed was current. issue14621 was opened to report this and a presentation on revisiting hash-flooding dos attacks was given later that year. While this presentation focused on attacking the MumurHash and CityHash64 hash functions, proof of concept code was shown for attacking Python's new hash table randomisation.

In addition to demonstrating weaknesses in existing hash functions for hash table randomisation, the presenters announced new algorithm called SipHash. SipHash is a cryptographic hash function. Unlike the hash functions above, it should be practically impossible to determine the input from the output of a cryptographic hash function. This mitigates the types of attacks previously demonstrated.

Other issues with the Python implementation also exist. The algorithm performs poorly and the implementation is difficult to replace with an alternatives more appropriate for embedded systems. To discuss these issues and possible solutions, Christian Heimes drafted PEP456. Its currently working towards acceptance.

Early test results from Christian of using SipHash in Python show positive results for both x86 and x64. Hopefully PEP456 will achieve acceptance in time for Python 3.4's release in February next year. Then this story can come to an end.


Rebooting TwoSlug

This week, like last week, I've been revamping an old project and learning some new tools. This time it's been TwoSlug receiving my attention.

TwoSlug is a fun little web tool to generate random two word phrases, or slug lines, using a randomly chosen verb and noun. Words are chosen from Princeton University's WordNet database. It was originally written almost year ago, in a single evening, and hosted on GitHub. Unfortunately it's been limited to a simple, static, HTML page until now.

Aside from the obligatory themeing, I've added two major new features to TwoSlug. The first is an API for requesting your own random slug lines. With an HTTP GET request you can ask for any combination of verbs, nouns, adjectives and adverbs. The API will return you random slug line in JSON format.

The second feature is word definitions using DuckDuckGo's excellent Instant Answer API. Definitions are displayed as pop ups on each word. The words themselves are links to the original definition source. The team at DuckDuckGo are extremely generous to offer their API with very restrictions.

Now, if I can just get ElasticSearch to use TwoSlug to generate a random node name each time a node starts.