MapReduce in 10 or so lines of Python

I’ve realized that I understand things best when I implement them myself, and I was recently reading Trevor Strohman’s dissertation, intriguied by TupleFlow, a kind of more elaborate and improved MapReduce, and was about to write my own toy impl of TupleFlow when I decided to simplify and just for fun write MapReduce in Python.

The goal of here is for a simple and short implementation, and with comments stripped out we have:

def MrSimple(producer, mapper, reducer, consumer):
    stage1 = []
    for n, v in producer():
      for n2, v2 in mapper(n, v):
        stage1.append((n2, v2))
    for n2, vals in itertools.groupby(sorted(stage1), lambda x: x[0]):
      seconds = (second[1] for second in vals)
      for v2 in reducer(n2, seconds):
        consumer(n2, v2)


..

producer is a generator that yields a series of name, value pairs – in the classic term frequency counting case it would return file,contents pairs.

mapper takes in name,value pairs and generates a series of name2,value2 pairs. In the word freq case it would emit (term,’1′) pairs for every word in ‘value2′.

reducer is called with (name, values) and emits ‘value3′ that are associated with the name.

consumer is used to persist the results of reducer.

This MapReduce runs in three stages:

  1. Run producer and mapper.
  2. Sort the name,value pairs the mapper returned.
  3. Run the reducer and consumer.

A compile of implementation notes are:

  • Ideally I would use the builtin map() however I think that would complicate the code.
  • I do use itertools.groupby() which is very handy.

A real implementation would use multiple threads, multiple processes, and be able to process data sets larger than fit into memory.

The core code is in mr_simple.py and a demonstration driver is mr_simple_demo.py. I’ve recently started storing any personal projects that I’m not totally embarrassed by in code.google.com BTW.

To follow – a variation that can work on larger data sets.

For a similar stab at it see this.

2 Responses to MapReduce in 10 or so lines of Python »»


Comments

  1. Comment by Amit Patel | 2009/04/26 at 10:12:38

    I think of MapReduce grouping the results as the mapper produces them:

    def MrSimpler(producer, mapper, reducer, consumer):
    … sharded = {}
    … for n, v in producer():
    … … for key, value in mapper(n, v):
    … … … sharded[key].setdefault([]).append(value)
    … for key, values in sharded:
    … … consumer(key, reducer(key, values))

    I’m not sure what the actual implementation does though.

  2. Comment by dave | 2009/04/26 at 22:39:25

    Yes, nice, thank you, this is much more logical, concise, crisp, and in the spirit of things.


Leave a Reply »»