hadoop

Pages tagged hadoop:

Amazon Elastic MapReduce
http://aws.amazon.com/elasticmapreduce/

There's a growing trend to provide some pretty awesome IT services over the internet. Seems to me that's the way it will mostly be in a decade's time - or less.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Who needs infrastructure? Keep your data somewhere else. Process your data somewhere else. You can now run your small data business out of your garage. Just photoshop a nice office for the investors.

Cloudera's Basic Hadoop Training | Cloudera
http://www.cloudera.com/hadoop-training-basic

Cloudera's Basic Hadoop Training is available online, free of charge. If you have questions about the content, please feel free to direct them to community support. Note: The activities and tutorials suggest downloading our virtual machine (VM). They all use the same VM, so if you download it once, there is no need to do so again.

Amazon Elastic MapReduceを使ってみた - moratorium
http://kzk9.net/blog/2009/04/reviewing_amazon_elastic_map_reduce.html
mrtoolkit - Google Code
http://code.google.com/p/mrtoolkit/

MRToolkit provides a framework for building simple Map/Reduce jobs in just a few lines of code.

Map/Reduce Jobs for Hadoop in Ruby

MRToolkit provides a framework for building simple Map/Reduce jobs in just a few lines of code. You provide only the map and reduce logic, the framework does the rest. Or use one of the provided map or reduce tools, and write even less.

Wrapper around Hadoop's Map/Reduce for easier writing of jobs.

rubyのhadoopラッパー

Project Voldemort Blog : Building a terabyte-scale data cycle at LinkedIn with Hadoop and Project Voldemort
http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/

Not one of those "we're using hadoop, now we're cool" articles. Well written!

Hadoop

bashreduce: A Bare-Bones MapReduce | Linux Magazine
http://www.linux-mag.com/cache/7407/1.html

heh. maybe useful for learning the mapreduce paradigm?

t

Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services
http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

Just wanted to add that even though there is a single point of failure the reliability due to software bugs has not been an issue and the dfs Namenode has been very stable. The Jobtracker crashes that we have seen are due to errant jobs - job isolation is not yet that great in hadoop and a bad query from a user can bring down the tracker (though the recovery time for the tracker is literally a few minutes). There is some good work happening in the community though to address those issues.

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

HadoopDB Project
http://db.cs.yale.edu/hadoopdb/hadoopdb.html

An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.

HadoopDB is: 1. A hybrid of DBMS and MapReduce technologies that targets analytical workloads 2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud 3. An attempt to fill the gap in the market for a free and open source parallel DBMS 4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems. 5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

DBMS Musings: Announcing release of HadoopDB (longer version)
http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html

my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2 | Cloudera
http://www.cloudera.com/hadoop-data-intensive-application-tutorial

This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools * Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3 * Run simple Pig and Hive commands on the log data * Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count) * Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script * Join the trend data in Hive with a table of Wikipedia page IDs * Export the trend query results to S3 as a tab delimited text file for use in our web application's MySQL database

This tutorial will show how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

Last.fm – the Blog · Mapreduce Bash Script
http://blog.last.fm/2009/04/06/mapreduce-bash-script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

Hardcoded version of push

Map-Reduce implemented as a bash script!

MapReduce in a Bash Script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce

Map Reduce implemented in bash using sort, awk, grep, join.

MapReduce programming with Apache Hadoop - Java World
http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html

hadoop

Google and its MapReduce framework may rule the roost when it comes to massive-scale data processing, but there's still plenty of that goodness to go around. This article gets you started with Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi Shankar and Govindu Narendra first demonstrate the powerful combination of map and reduce in a simple Java program, then walk you through a more complex data-processing application based on Hadoop. Finally, they show you how to install and deploy your application in both standalone mode and clustering mode.

A search engine for trending topics. Built by Data Wrangling with Cloudera Hadoop shows some massive data processing habilities.

awesome website that mines wikipedia traffic levels

Hadoopで、かんたん分散処理 (Yahoo! JAPAN Tech Blog)
http://techblog.yahoo.co.jp/cat207/cat209/hadoop/
InfoQ: Clojure and Rails - the Secret Sauce Behind FlightCaster
http://www.infoq.com/articles/flightcaster-clojure-rails

Clojure is a LISP for the JVM created by Rich Hickey.

FlightCaster, a realtime flight delay site, is built on Clojure and Hadoop for the statistical analysis. The web frontend is built with Ruby on Rails and hosted on Heroku. We talked to Bradford Cross about Clojure, functional programming and tips for OOP developers interested in making the jump.

Another critical piece of infrastructure is Cascading; an excellent layer on top of Hadoop that adds additional abstraction and functionality. We definitely recommend Cascading to anyone doing serious data processing and mining with Hadoop.

The Anatomy of Hadoop I/O Pipeline (Hadoop and Distributed Computing at Yahoo!)
http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html
Facebook | Engineering @ Facebook's Notes
http://www.facebook.com/note.php?note_id=89508453919
Hadoop - YDN
http://developer.yahoo.com/hadoop/

"Apache Hadoop* is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware."

Hadoop and Distributed Computing at Yahoo!

クックパッドとHadoop « クックパッド開発者ブログ
http://techlife.cookpad.com/2009/09/16/cookpad-hadoop-introduction/

わかりやすい資料。

クックパッドとHadoop « クックパッド開発者ブログはじめまして。今年の5月に入社した勝間@さがすチームです。入社してからは、なかなか大変なことも多いですが、最近はお酒好きが集まって月曜から飲み合う「勝間会」なるものも発足して、仕事面でも仕事以外の面でも密度の高い毎日を過ごしています！さて、僕は「さがす」チーム所属ということで、普段はレシピを「さがす」ユーザの満足度を上げるために、クックパッドの検索まわりについて、いろいろな開発を行っていま... はてなブックマーク - クックパッドとHadoop « クックパッド開発者ブログはてなブックマークに追加 dann dann hadoop

Easy Map-Reduce With Hadoop Streaming - igvita.com
http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/

If you're considering doing large scale analysis of structured data (access logs, for example), there are dozens of enterprise-level solutions ranging from specialized streaming databases, to the more mundane data warehousing solutions with star topologies and column store semantics. Google, facing the same problem, developed a system called Sawzall, which leverages their existing Map-Reduce clusters for large scale parallel data analysis by adding a DSL for easy manipulation of data.

Map/Reduce Toolkit by NY Times engineers is a great example of a Ruby DSL on top of the Hadoop Streaming interface. Specifically aimed at simplifying their internal log processing jobs, it exposes just the necessary bits for handling the access log inputs and provides a number of predefined reduce steps: unique, counter, etc. For example, to get a list of all unique visitor IP's, the entire program consists of:

Google Technology RoundTable: Map Reduce
http://research.google.com/roundtable/MR.html

Matt is also the author of

octo.py: quick and easy MapReduce for Python
http://ebiquity.umbc.edu/blogger/2009/01/02/octopy-quick-and-easy-mapreduce-for-python/

octo.py: quick and easy MapReduce for Python

showcases an example of using the mapreduce system octo.py

Twitter / OpenSource
http://twitter.com/about/opensource

Twitter is built on open-source software—here are the projects we have released or contribute to. Also see our engineering blog for more details. Want to work on stuff like this? Check out our jobs.

This is what #Twitter is built on! http://bit.ly/aySbTA #Opensource

Tendencias de programação avançada

Twitter is built on open-source software—here are the projects we have released or contribute to. Also see our engineering blog for more details.

Hadoop Live CD at OpenSolaris.org
http://opensolaris.org/os/project/livehadoop/

OpenSolaris Project: Hadoop Live CD

A Comparison of Approaches to Large-Scale Data Analysis - MapReduce vs. DBMS Benchmarks
http://database.cs.brown.edu/sigmod09/

"The following information is meant to provide documentation on how others can recreate the benchmark trials used in our SIGMOD 2009 paper."

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

HBase vs Cassandra: why we moved « Bits and Bytes.
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

HBase vs Cassandra: why we moved

Data-Intensive Text Processing with MapReduce
http://www.umiacs.umd.edu/~jimmylin/book.html
NoSQL at Twitter (NoSQL EU 2010)
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010

A discussion of the different NoSQL-style datastores in use at Twitter, including Hadoop (with Pig for analysis), HBase,

Twitters NoSQL slides

A discussion of the different NoSQL-style datastores in use at Twitter, including Hadoop (with Pig for analysis), HBase, Cassandra, and FlockDB.

cassandra,thrift, hdfs, hbase, scribe,pig,lzo, flockdb

interesting presentation on #NoSQL at #twitter by @kevinweil http://bit.ly/99h8BK [from http://twitter.com/behi_at/statuses/13587582774]