InfiniteSortedObjectSequence – for large data sets in Python

What do you do when you have a data set too large to fit into RAM?

You could just use disk directly, so instead of a dict you use a shelve or bsddb, however the problem with that is that you then have a performance hit as all operations are disk based.

You could have a specialization of such a data structure that tries to use RAM however it spills data to disk when necessary.

I’ve been playing with implementing MapReduce in Python, and for the case when the data set doesn’t fit into RAM you need to stream a potentially large, unordered sequence of tuples and then sort them.

The basic use case is that there is any number of Append() calls, and then when you’re done you want the data sorted.

I wrote InfiniteSortedObjectSequence which will be used in my Pythonic MapReduce code that works on large data sets (TBD), however it’s useful in isolation, thus this note.

The code is http://code.google.com/p/tropo/source/browse/trunk/Python/tr_mapreduce/infinite.py. You’ll also need pickle_io and mergesort from that same directory.

The basic way it works is:

infinite = InfiniteSortedObjectSequence()

infinite.Append(...) # call this many times

sorted = infinite.Sort()
for tuple in sorted:
    ...

Behind the scenes when it spills to disk it sorts the “run”, then writes it to a pickled file. There are options to compress the data when it’s written to disk, which unfortunately takes longer but uses less space.

To sort it does an in-memory sort if the data set is small enough, else it does a merge sort using an N-way merge sort.

This seems to work well for a data set that is larger than RAM but which fits on a hard drive. More hard core industry strength solutions would probably not be in python :) and would try to utilize more disks, control the number of files open during a merge, and maybe use async i/o.

Note that this class is for a specialized use case that is somewhat normal in some aspects of IR and text processing: first you gather the data, then you want it sorted — there’s no random access lookup, length call, etc, just Append() and Sort().

Leave a Reply »»