Preparing for Mavericks

Apple has released OSX 10.9, also known as Mavericks. Which means it's time to prepare for an operating system upgrade again. In the interest of soliciting feedback, I'm sharing the rough process I'm using for this upgrade.

  • Update all software. In addition to installing all pending App Store updates, I make sure all any applications that have auto-updaters are started and they check for new updates. In particular this means running Mail to ensure GPGTools updates GPGMail.
  • Backup all important data. Most of my important but not confidential data is already stored remotely on GitHub and Dropbox. I'm in the process of setting up Arq to backup the remainder.
  • Update and upgrade Homebrew. Rather than just upgrade Homebrew, for this upgrade I've decided to completely uninstall and reinstall by following the uninstall instructions on the Homebrew FAQ and the commands in this Gist.
  • Compile a list of post upgrade procedures. Currently, the only process I've recorded is to use xcode-select to install the latest command line tools.

It's hardly a detailed or rigorous plan. If you do have feedback, or think any thing is missing, please leave a comment below. Alternatively, there is also Twitter or email.


What not to do with ZeroMQ

ZeroMQ is a great networking library, and the PyZMQ package makes that greatness accessible from Python. This week however, I encountered an implementation pattern that is incompatible with ZeroMQ.

For "reasons", I had wanted to use ZeroMQ inside a process in a way that was blind to process forks. Unfortunately, if a child interacts with a ZeroMQ context inherited from its parent in anyway, including attempting to close it, ZeroMQ will likely terminate with an assertion failure. Compounding this, not being able to close the context means leaking file descriptors. The worst case scenario is a child that does some work then forks, the parent exits while the child repeats the sequence.

Here is a silly example that exercises a worst case.

import os
import sys
import zmq

ADDRESS = 'tcp://127.0.0.1:5555'
MAX_FORKS = 4096

# the original parent creates a zeromq context and socket
ctx = zmq.Context()
sock = ctx.socket(zmq.PULL)
sock.bind(ADDRESS)

forked = 0
if os.fork() != 0:
    # this parent listens for messages from its children
    while forked < MAX_FORKS:
        forked = sock.recv_json()
        sys.stdout.write('.')
        sys.stdout.flush()
    sys.exit(0)
else:
    # the children discard the parent's context,
    # open a zeromq socket and send a message
    # to the first parent
    while forked < MAX_FORKS:
        forked += 1
        del ctx, sock
        ctx = zmq.Context()
        sock = ctx.socket(zmq.PUSH)
        sock.connect(ADDRESS)
        sock.send_json(forked)
        # after send a message, this child exits
        # spawning a new child to send a new message
        if os.fork() != 0:
            sys.exit(0)

Honestly this is a terrible design when using ZeroMQ. As evidenced by its inevitable failure from running out of file descriptors. Sadly, I also have a valid reason for wanting to be robust with this design. So my choices include:

  1. Do not support network in children. (A valid but regrettable limitation)
  2. Apply a different networking solution. (Yak shaving)

Even if an atfork module that mirrored the existing atexit was added to Python, it may not resolve this issue for me. The proposed atfork implementation is Python only. When using CPython it would be oblivious to any forks from inside C extensions.

ZeroMQ is still great. It's just not intended to work in this situation.


Python hash table randomisation

This is an incomplete, abbreviated and under researched history of hash table randomisation in Python. It's intended to be accessible and high level. For technical details please use the references provided. Now, a good place to start is, what is a hash table and why would Python want to randomise it?

A hash table is the data structure used to implement Python dictionaries. A hash table can efficiently look up a value by mapping the key to a specific position. An important property of hash tables is how they handle collision resolution. The performance of a hash table is predicated on there being few collisions between multiple keys mapping to the same position. When there are a large number of collisions, performance can degrade dramatically.

It's this degradation in performance that was the basis of a 2011 presentation on efficient denial of service attacks. Denial of service, or DOS attacks, on hash tables weren't exactly new. But practical attacks on real world implementations were. Naturally, in addition to the security list, it was reported to python-dev and issue13703 was opened.

There are different solutions to this problem. Use a different collision resolution method is one. Provide an alternate data structure, such as a trie, is another. (Although I wouldn't use the one I wrote just yet) The generally accepted solution was to introduce some randomisation into the algorithm that maps keys to positions.

To achieve this randomisation, keys are passed through a hash function before placing or retrieving values. The hash function is seeded with a random value when Python starts. This makes it more difficult for an attacker to predict collisions. Hopefully so difficult as to render DOS attacks on the hash table impractical. Hash randomisation was enabled by default in Python 3.3.

Unfortunately the story doesn't end there. The algorithm used to randomise hash tables didn't quite work. Too few effective seeds ended up being used, making it possible to test for which seed was current. issue14621 was opened to report this and a presentation on revisiting hash-flooding dos attacks was given later that year. While this presentation focused on attacking the MumurHash and CityHash64 hash functions, proof of concept code was shown for attacking Python's new hash table randomisation.

In addition to demonstrating weaknesses in existing hash functions for hash table randomisation, the presenters announced new algorithm called SipHash. SipHash is a cryptographic hash function. Unlike the hash functions above, it should be practically impossible to determine the input from the output of a cryptographic hash function. This mitigates the types of attacks previously demonstrated.

Other issues with the Python implementation also exist. The algorithm performs poorly and the implementation is difficult to replace with an alternatives more appropriate for embedded systems. To discuss these issues and possible solutions, Christian Heimes drafted PEP456. Its currently working towards acceptance.

Early test results from Christian of using SipHash in Python show positive results for both x86 and x64. Hopefully PEP456 will achieve acceptance in time for Python 3.4's release in February next year. Then this story can come to an end.