Announcing the Map/Reduce Toolkit - Open Blog - NYTimes.com
To illustrate how simple it can be, here’s an actual program that counts the browsing requests from each IP address. This is really all there is to it!
"... Such projects have required special knowledge and expertise. The Map/Reduce Toolkit (MRToolkit) aims to change this. It takes care of the details of setting up and running Hadoop jobs, and encapsulates most of the complexity of writing map and reduce steps. The toolkit, which is Ruby-based, provides the framework — you only have to supply the details of the map and reduce steps."
Package for making it easier to use mapreduce for batch processing, from NYTimes.Easy Map-Reduce With Hadoop Streaming - igvita.com
If you're considering doing large scale analysis of structured data (access logs, for example), there are dozens of enterprise-level solutions ranging from specialized streaming databases, to the more mundane data warehousing solutions with star topologies and column store semantics. Google, facing the same problem, developed a system called Sawzall, which leverages their existing Map-Reduce clusters for large scale parallel data analysis by adding a DSL for easy manipulation of data.
Map/Reduce Toolkit by NY Times engineers is a great example of a Ruby DSL on top of the Hadoop Streaming interface. Specifically aimed at simplifying their internal log processing jobs, it exposes just the necessary bits for handling the access log inputs and provides a number of predefined reduce steps: unique, counter, etc. For example, to get a list of all unique visitor IP's, the entire program consists of: