<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>I'm Sorry Dave - Dave Spencer's Weblog &#187; python</title>
	<atom:link href="http://www.chencer.com/dave/blog/category/code/python-code/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.chencer.com/dave/blog</link>
	<description>David Spencer's personal weblog</description>
	<lastBuildDate>Tue, 15 Feb 2011 06:41:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.5</generator>
		<item>
		<title>Python&#8217;s multiprocessing module is the new hottness</title>
		<link>http://www.chencer.com/dave/blog/2008/12/05/pythons-multiprocessing-module-is-the-new-hottness/</link>
		<comments>http://www.chencer.com/dave/blog/2008/12/05/pythons-multiprocessing-module-is-the-new-hottness/#comments</comments>
		<pubDate>Sat, 06 Dec 2008 05:38:15 +0000</pubDate>
		<dc:creator>dave</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[interesting]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tech]]></category>

		<guid isPermaLink="false">http://www.tropo.com/dave/blog/2008/12/05/pythons-multiprocessing-module-is-the-new-hottness/</guid>
		<description><![CDATA[The new multiprocessing module in Python 2.6 and 3.0 looks pretty cool. It gets around the whole, um, design for low performance where the dreaded Global Interpreter Lock (GIL) makes multithreading difficult, by making it easy to spawn python subprocesses, communicate with them, and share data. They even have a form of security on the [...]]]></description>
			<content:encoded><![CDATA[<p>The new <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing</a> module in Python 2.6 and 3.0 looks pretty cool. It gets around the whole, um, design for low performance where  the dreaded Global Interpreter Lock (GIL) makes multithreading difficult, by making it easy to spawn python subprocesses, communicate with them, and share data. They even have a form of security on the sockets so that bad guys can&#8217;t send any data to any of the processes w/o knowing a secret key.</p>
<p>Note to self: explore writing a toy multi-core mapreduce with &#8216;import multiprocessing&#8217;&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chencer.com/dave/blog/2008/12/05/pythons-multiprocessing-module-is-the-new-hottness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>InfiniteSortedObjectSequence &#8211; for large data sets in Python</title>
		<link>http://www.chencer.com/dave/blog/2008/07/13/infinitesortedobjectsequence-for-large-data-sets-in-python/</link>
		<comments>http://www.chencer.com/dave/blog/2008/07/13/infinitesortedobjectsequence-for-large-data-sets-in-python/#comments</comments>
		<pubDate>Mon, 14 Jul 2008 06:34:54 +0000</pubDate>
		<dc:creator>dave</dc:creator>
				<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.tropo.com/dave/blog/2008/07/13/infinitesortedobjectsequence-for-large-data-sets-in-python/</guid>
		<description><![CDATA[What do you do when you have a data set too large to fit into RAM? You could just use disk directly, so instead of a dict you use a shelve or bsddb, however the problem with that is that you then have a performance hit as all operations are disk based. You could have [...]]]></description>
			<content:encoded><![CDATA[<p>What do you do when you have a data set too large to fit into RAM?</p>
<p>You could just use disk directly, so instead of a dict you use a shelve or bsddb, however the problem with that is that you then have a performance hit as all operations are disk based.</p>
<p>You could have a specialization of such a data structure that tries to use RAM however it spills data to disk when necessary.</p>
<p>I&#8217;ve been playing with <a href="http://www.chencer.com/dave/blog/2008/07/09/mapreduce-in-10-or-so-lines-of-python/">implementing MapReduce in Python</a>, and for the case when the data set doesn&#8217;t fit into RAM you need to stream a potentially large, unordered sequence of tuples and then sort them.</p>
<p>The basic use case is that there is any number of Append() calls, and then when you&#8217;re done you want the data sorted.</p>
<p>I wrote InfiniteSortedObjectSequence which will be used in my Pythonic MapReduce code that works on large data sets (TBD), however it&#8217;s useful in isolation, thus this note.</p>
<p>The code is <a href="http://code.google.com/p/tropo/source/browse/trunk/Python/tr_mapreduce/infinite.py">http://code.google.com/p/tropo/source/browse/trunk/Python/tr_mapreduce/infinite.py</a>. You&#8217;ll also need pickle_io and mergesort from that same directory.</p>
<p>The basic way it works is:</p>
<p><code>
<pre>
infinite = InfiniteSortedObjectSequence()

infinite.Append(...) # call this many times

sorted = infinite.Sort()
for tuple in sorted:
    ...
</pre>
<p></code></p>
<p>Behind the scenes when it spills to disk it sorts the &#8220;run&#8221;, then writes it to a pickled file. There are options to compress the data when it&#8217;s written to disk, which unfortunately takes longer but uses less space.</p>
<p>To sort it does an in-memory sort if the data set is small enough, else it does a merge sort using an<a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/511509"> N-way merge sort</a>.</p>
<p>This seems to work well for a data set that is larger than RAM but which fits on a hard drive. More hard core industry strength solutions would probably not be in python <img src='http://www.chencer.com/dave/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  and would try to utilize more disks, control the number of files open during a merge, and maybe use async i/o.</p>
<p>Note that this class is for a specialized use case that is somewhat normal in some aspects of IR and text processing: first you gather the data, then you want it sorted &#8212; there&#8217;s no random access lookup, length call, etc, just Append() and Sort().</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chencer.com/dave/blog/2008/07/13/infinitesortedobjectsequence-for-large-data-sets-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MapReduce in 10 or so lines of Python</title>
		<link>http://www.chencer.com/dave/blog/2008/07/09/mapreduce-in-10-or-so-lines-of-python/</link>
		<comments>http://www.chencer.com/dave/blog/2008/07/09/mapreduce-in-10-or-so-lines-of-python/#comments</comments>
		<pubDate>Thu, 10 Jul 2008 06:01:57 +0000</pubDate>
		<dc:creator>dave</dc:creator>
				<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.tropo.com/dave/blog/2008/07/09/mapreduce-in-10-or-so-lines-of-python/</guid>
		<description><![CDATA[I&#8217;ve realized that I understand things best when I implement them myself, and I was recently reading Trevor Strohman&#8217;s dissertation, intriguied by TupleFlow, a kind of more elaborate and improved MapReduce, and was about to write my own toy impl of TupleFlow when I decided to simplify and just for fun write MapReduce in Python. [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve realized that I understand things best when I implement them myself, and I was recently reading <a href="http://ciir.cs.umass.edu/~strohman/">Trevor Strohman&#8217;s </a> <a href="http://ciir.cs.umass.edu/~strohman/dissertation/">dissertation</a>, intriguied by TupleFlow, a kind of more elaborate and improved <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a>, and was about to write my own toy impl of TupleFlow when I decided to simplify and just for fun write MapReduce in Python.</p>
<p>The goal of here is for a simple and short implementation, and with comments stripped out we have:</p>
<p><code>
<pre>def MrSimple(producer, mapper, reducer, consumer):
    stage1 = []
    for n, v in producer():
      for n2, v2 in mapper(n, v):
        stage1.append((n2, v2))
    for n2, vals in itertools.groupby(sorted(stage1), lambda x: x[0]):
      seconds = (second[1] for second in vals)
      for v2 in reducer(n2, seconds):
        consumer(n2, v2)
</pre>
<p></code><br />
..</p>
<p>producer is a generator that yields a series of name, value pairs &#8211; in the classic term frequency counting case it would return file,contents pairs.</p>
<p>mapper takes in name,value pairs and generates a series of name2,value2 pairs. In the word freq case it would emit (term,&#8217;1&#8242;) pairs for every word in &#8216;value2&#8242;.</p>
<p>reducer is called with (name, values) and emits &#8216;value3&#8242; that are associated with the name.</p>
<p>consumer is used to persist the results of reducer.</p>
<p>This MapReduce runs in three stages:</p>
<ol>
<li>Run producer and mapper.</li>
<li>Sort the name,value pairs the mapper returned.</li>
<li>Run the reducer and consumer.</li>
</ol>
<p>A compile of implementation notes are:</p>
<ul>
<li>Ideally I would use the builtin map() however I think that would complicate the code.</li>
<li>I do use itertools.groupby() which is very handy.</li>
</ul>
<p>A real implementation would use multiple threads, multiple processes, and be able to process data sets larger than fit into memory.</p>
<p>The core code is in<a href="http://code.google.com/p/tropo/source/browse/trunk/Python/tr_mapreduce/mr_simple.py"> mr_simple.py</a> and a demonstration driver is <a href="http://code.google.com/p/tropo/source/browse/trunk/Python/tr_mapreduce/mr_simple_demo.py">mr_simple_demo.py</a>. I&#8217;ve recently started storing any personal projects that I&#8217;m not totally embarrassed by in <a href="http://code.google.com/p/tropo">code.google.com</a> BTW.</p>
<p>To follow &#8211; a variation that can work on larger data sets.</p>
<p>For a similar stab at it see <a href="http://outgoing.typepad.com/outgoing/2005/04/mapreduce.html">this</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.chencer.com/dave/blog/2008/07/09/mapreduce-in-10-or-so-lines-of-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

