Radio Free Agile

March 24, 2008

SPA2008: Tom White on Understanding MapReduce with Hadoop

Filed under: Hadoop, SPA2008 — Tamara Petroff @ 12:12 pm

Tom White spoke on the Google-instigated MapReduce Tool. Its purpose is to process large amounts of data efficiently, to do things such as searching and sorting that would take days, months or years if traditional methods were used. Organisations now have more and more data, to the point that needing to process a terabyte of data isn’t so unusual.

A solution is to use parallelism – put chunks of data on different machines and run (for instance) grep independently on each, then aggregate the results.

Another aspect of this comes from physical limitations of hard disk drive access. If you can do processing at the speed of data transfer (as opposed to the speed of disk access) then you will have an advantage. This leads to the concept of streaming DBs and in-memory DBs.

This is the strategy used by MapReduce – it does merges and sorts at the transfer rate.

It is batch-oriented – for doing analysis offline, not at web speed.

It is designed to work on unstructured data and doesn’t assume a schema.

The timeline of MapReduce:

Feb 2003 – First use at Google

Dec 2004 – Google’s paper published

July 2005 – Nutch adopts MapReduce, led by Doug Cutting

Jan 2006 – Doug Cutting joins Yahoo!

Feb 2006 – Hadoop moves into the Lucene project

Apr 2007 – Yahoo! runs Hadoop on 1000 node cluster

Jan 2008 – Hadoop is made an Apache top-level project

Feb 2008 – Yahoo! generate production web search index using Hadoop

Hadoop (the open source project) is MapReduce plus a bit:

* The Hadoop distributed file system

* Pig – A high-level language for data analysis

* HBase – storage for semi-structured data – input or output

* MapReduce

The general form of a MapReduce operation consists of two steps – a map step, and a reduce step. The details of what happens at each step is configurable by code.

Tom introduced the following notation to represent what happens.

Map: (K1, V1) -> list(K2, v2)

where K1, K2 are keys

and V1, V2 are values

The output is of the same type as the input.

Reduce: (K2, list(V2)) -> list(K3, V3)

As an example, here is an implementation of grep (a Unix function that finds lines matching a given regular expression)

Map: (offset, regex) -> [(match, 1)]

So each line matching the regex is output as a key with a constant value of 1.

Reduce: (match, [1,1,1,...]) -> [(match, n)]

After the reduce step, you get a list of matching lines together with a count of each line.

The input HDFS is split into chunks across machines. It is important not to have to move chunks around across the network – this is what you are trying to avoid. The map runs on the same physical box as the file chunk. Some optimisation occurs that does part of the reduce step before moving files, to minimise data shoved across the network.

There is a single job tracker that manages the parceling of tasks to nodes in the network. Each node has a task tracker to manage its tasks.

Of interest is the sort function – both the map and reduce steps are identity functions; the MapReduce framework does the sorting inherently.

The name “Hadoop” comes from the name of the principle developer’s son’s plush elephant, and does not stand for “high availability distributed object-oriented programming” as was my guess when I first heard it!  Just shows how wrong we can sometimes be…

* * *

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment