mapreduce

Pages tagged mapreduce:

Collaborative Map-Reduce in the Browser - igvita.com
http://www.igvita.com/2009/03/03/collaborative-map-reduce-in-the-browser/

After several iterations, false starts, and great conversations with Michael Nielsen, a flash of the obvious came: HTTP + Javascript! What if you could contribute to a computational (Map-Reduce) job by simply pointing your browser to a URL? Surely your social network wouldn't mind opening a background tab to help you crunch a dataset or two!

Amazon Elastic MapReduce
http://aws.amazon.com/elasticmapreduce/

There's a growing trend to provide some pretty awesome IT services over the internet. Seems to me that's the way it will mostly be in a decade's time - or less.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Who needs infrastructure? Keep your data somewhere else. Process your data somewhere else. You can now run your small data business out of your garage. Just photoshop a nice office for the investors.

Cloudera's Basic Hadoop Training | Cloudera
http://www.cloudera.com/hadoop-training-basic

Cloudera's Basic Hadoop Training is available online, free of charge. If you have questions about the content, please feel free to direct them to community support. Note: The activities and tutorials suggest downloading our virtual machine (VM). They all use the same VM, so if you download it once, there is no need to do so again.

Amazon Elastic MapReduceを使ってみた - moratorium
http://kzk9.net/blog/2009/04/reviewing_amazon_elastic_map_reduce.html
Announcing the Map/Reduce Toolkit - Open Blog - NYTimes.com
http://open.blogs.nytimes.com/2009/05/11/announcing-the-mapreduce-toolkit/

To illustrate how simple it can be, here’s an actual program that counts the browsing requests from each IP address. This is really all there is to it!

"... Such projects have required special knowledge and expertise. The Map/Reduce Toolkit (MRToolkit) aims to change this. It takes care of the details of setting up and running Hadoop jobs, and encapsulates most of the complexity of writing map and reduce steps. The toolkit, which is Ruby-based, provides the framework — you only have to supply the details of the map and reduce steps."

Package for making it easier to use mapreduce for batch processing, from NYTimes.

mrtoolkit - Google Code
http://code.google.com/p/mrtoolkit/

MRToolkit provides a framework for building simple Map/Reduce jobs in just a few lines of code.

Map/Reduce Jobs for Hadoop in Ruby

MRToolkit provides a framework for building simple Map/Reduce jobs in just a few lines of code. You provide only the map and reduce logic, the framework does the rest. Or use one of the provided map or reduce tools, and write even less.

Wrapper around Hadoop's Map/Reduce for easier writing of jobs.

rubyのhadoopラッパー

erikfrey's bashreduce at master - GitHub
http://github.com/erikfrey/bashreduce/tree/master

whoah, wtf.

Map/Reduce in a bash script... hahahahahahaha

MapReduce done in BASH! Awesome!

Some mad bash magic for distributing stuff.

interesting hack -- apply Map-Reduce idioms to UNIX command lines across multiple machines or cores (via jzawodny, who's obviously looking at a lot of command line stuff recently ;)

bashreduce: A Bare-Bones MapReduce | Linux Magazine
http://www.linux-mag.com/cache/7407/1.html

heh. maybe useful for learning the mapreduce paradigm?

t

Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services
http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

Just wanted to add that even though there is a single point of failure the reliability due to software bugs has not been an issue and the dfs Namenode has been very stable. The Jobtracker crashes that we have seen are due to errant jobs - job isolation is not yet that great in hadoop and a bad query from a user can bring down the tracker (though the recovery time for the tracker is literally a few minutes). There is some good work happening in the community though to address those issues.

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

HadoopDB Project
http://db.cs.yale.edu/hadoopdb/hadoopdb.html

An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.

HadoopDB is: 1. A hybrid of DBMS and MapReduce technologies that targets analytical workloads 2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud 3. An attempt to fill the gap in the market for a free and open source parallel DBMS 4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems. 5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

DBMS Musings: Announcing release of HadoopDB (longer version)
http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html

my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2 | Cloudera
http://www.cloudera.com/hadoop-data-intensive-application-tutorial

This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools * Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3 * Run simple Pig and Hive commands on the log data * Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count) * Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script * Join the trend data in Hive with a table of Wikipedia page IDs * Export the trend query results to S3 as a tab delimited text file for use in our web application's MySQL database

This tutorial will show how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

Gearman
http://gearman.org/

# Reverse Worker Code $worker= new GearmanWorker(); $worker->addServer(); $worker->addFunction("reverse", "my_reverse_function"); while ($worker->work()); function my_reverse_function($job) { return strrev($job->workload()); }

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates.

language independent worker framework

Last.fm – the Blog · Mapreduce Bash Script
http://blog.last.fm/2009/04/06/mapreduce-bash-script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

Hardcoded version of push

Map-Reduce implemented as a bash script!

MapReduce in a Bash Script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce

Map Reduce implemented in bash using sort, awk, grep, join.

Riak - A Decentralized Database
http://riak.basho.com/

Riak combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.

MapReduce programming with Apache Hadoop - Java World
http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html

hadoop

Google and its MapReduce framework may rule the roost when it comes to massive-scale data processing, but there's still plenty of that goodness to go around. This article gets you started with Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi Shankar and Govindu Narendra first demonstrate the powerful combination of map and reduce in a simple Java program, then walk you through a more complex data-processing application based on Hadoop. Finally, they show you how to install and deploy your application in both standalone mode and clustering mode.

Hadoopで、かんたん分散処理 (Yahoo! JAPAN Tech Blog)
http://techblog.yahoo.co.jp/cat207/cat209/hadoop/
The Anatomy of Hadoop I/O Pipeline (Hadoop and Distributed Computing at Yahoo!)
http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html
Michael Nielsen » The Google Technology Stack
http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/

Interesting set of links and posts describing the technologies Google builds its software on, and how they work together.

The Google Technology Stack … or as I would put it: An Introduction to MapReduce, Data Mining and PageRank

A great in-depth treament of the engine that powers Google

Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.

Pragmatic Programming Techniques: NOSQL Patterns
http://horicky.blogspot.com/2009/11/nosql-patterns.html

A nice overview of some of the more popular patterns in NoSQL architecture

Hadoop - YDN
http://developer.yahoo.com/hadoop/

"Apache Hadoop* is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware."

Hadoop and Distributed Computing at Yahoo!

クックパッドとHadoop « クックパッド開発者ブログ
http://techlife.cookpad.com/2009/09/16/cookpad-hadoop-introduction/

わかりやすい資料。

クックパッドとHadoop « クックパッド開発者ブログはじめまして。今年の5月に入社した勝間@さがすチームです。入社してからは、なかなか大変なことも多いですが、最近はお酒好きが集まって月曜から飲み合う「勝間会」なるものも発足して、仕事面でも仕事以外の面でも密度の高い毎日を過ごしています！さて、僕は「さがす」チーム所属ということで、普段はレシピを「さがす」ユーザの満足度を上げるために、クックパッドの検索まわりについて、いろいろな開発を行っていま... はてなブックマーク - クックパッドとHadoop « クックパッド開発者ブログはてなブックマークに追加 dann dann hadoop

Easy Map-Reduce With Hadoop Streaming - igvita.com
http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/

If you're considering doing large scale analysis of structured data (access logs, for example), there are dozens of enterprise-level solutions ranging from specialized streaming databases, to the more mundane data warehousing solutions with star topologies and column store semantics. Google, facing the same problem, developed a system called Sawzall, which leverages their existing Map-Reduce clusters for large scale parallel data analysis by adding a DSL for easy manipulation of data.

Map/Reduce Toolkit by NY Times engineers is a great example of a Ruby DSL on top of the Hadoop Streaming interface. Specifically aimed at simplifying their internal log processing jobs, it exposes just the necessary bits for handling the access log inputs and provides a number of predefined reduce steps: unique, counter, etc. For example, to get a list of all unique visitor IP's, the entire program consists of:

Google Technology RoundTable: Map Reduce
http://research.google.com/roundtable/MR.html

Matt is also the author of

octo.py: quick and easy MapReduce for Python
http://ebiquity.umbc.edu/blogger/2009/01/02/octopy-quick-and-easy-mapreduce-for-python/

octo.py: quick and easy MapReduce for Python

showcases an example of using the mapreduce system octo.py

BrowserCouch Documentation
http://hg.toolness.com/browser-couch/raw-file/blog-post/index.html

BrowserCouch is an attempt at an in-browser MapReduce implementation.

BrowserCouch is an attempt at an in-browser MapReduce implementation. It's written entirely in JavaScript and intended to work on all browsers, gracefully upgrading when support for better efficiency or feature set is detected.Not coincidentally, this library is intended to mimic the functionality of CouchDB on the client-side, and may even support integration with CouchDB in the futur

"BrowserCouch is an attempt at an in-browser MapReduce implementation. It's written entirely in JavaScript and intended to work on all browsers, gracefully upgrading when support for better efficiency or feature set is detected. Not coincidentally, this library is intended to mimic the functionality of CouchDB on the client-side, and may even support integration with CouchDB in the future."

Hadoop Live CD at OpenSolaris.org
http://opensolaris.org/os/project/livehadoop/

OpenSolaris Project: Hadoop Live CD

A Comparison of Approaches to Large-Scale Data Analysis - MapReduce vs. DBMS Benchmarks
http://database.cs.brown.edu/sigmod09/

"The following information is meant to provide documentation on how others can recreate the benchmark trials used in our SIGMOD 2009 paper."

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Data-Intensive Text Processing with MapReduce
http://www.umiacs.umd.edu/~jimmylin/book.html