I’ve realized that I understand things best when I implement them myself, and I was recently reading Trevor Strohman’s dissertation, intriguied by TupleFlow, a kind of more elaborate and improved MapReduce, and was about to write my own toy impl of TupleFlow when I decided to simplify and just for fun write MapReduce in Python.
The goal of here is for a simple and short implementation, and with comments stripped out we have:
def MrSimple(producer, mapper, reducer, consumer):
stage1 = 
for n, v in producer():
for n2, v2 in mapper(n, v):
for n2, vals in itertools.groupby(sorted(stage1), lambda x: x):
seconds = (second for second in vals)
for v2 in reducer(n2, seconds):
producer is a generator that yields a series of name, value pairs – in the classic term frequency counting case it would return file,contents pairs.
mapper takes in name,value pairs and generates a series of name2,value2 pairs. In the word freq case it would emit (term,’1′) pairs for every word in ‘value2′.
reducer is called with (name, values) and emits ‘value3′ that are associated with the name.
consumer is used to persist the results of reducer.
This MapReduce runs in three stages:
- Run producer and mapper.
- Sort the name,value pairs the mapper returned.
- Run the reducer and consumer.
A compile of implementation notes are:
- Ideally I would use the builtin map() however I think that would complicate the code.
- I do use itertools.groupby() which is very handy.
A real implementation would use multiple threads, multiple processes, and be able to process data sets larger than fit into memory.
To follow – a variation that can work on larger data sets.
For a similar stab at it see this.