Reputation: 3444
I recently stumbled upon Google's MapReduce.
I have read the description / docs twice, and I still can not understand what exactly it is, or when to use it.
Thank you very much.
Upvotes: 4
Views: 553
Reputation: 16212
This question has been answered well but figured I'd add something. It seems to me that the crux of the question is that map-reduce itself is not understood. Google map-reduce is just one implementation. There's also Hadoop and all sorts of things. Here's a run down of the helloworld of map-reduce:
Say you have a book and you want to calculate the word count for each word. Here's one way of doing it:
word_dict = {}
for line in book_file_handler:
for word in line.split():
word_dict[word] = word_dict.get(word,0)+1
This is a bit of an over simplification because punctuation but whatevs.
So this code works. What if you want to make it run really fast by making use of your shiny cluster? It would be great to send a section of the book to each computer that's taking part in the calculation, getting each of those to count some words, and then combining the results. This would be possible because each line in the book is independent of every other line. And that's the kind of thing map-reduce is for:
If you have an algorithm that needs to perform the same operation over many independent objects such that the result of the operation does not depend on the result of any of the others then map-reduce is appropriate.
Upvotes: 0
Reputation: 9458
When you wish to have data parallelism.
Map reduce framework should be used when you have some heavy piece of computation which needs more than a single CPU. In map reduce first the task is divided into independent chunks. Those chunks are then computed separately. Once all the chunks get computed, the results get combined to give the final output. One common example is of machine learning. Many calculations for calculating coefficient vector can be performed separately and then results can be clubbed together. In short, if you have more than single CPU then only consider using map reduce, else it doesn't make sense.
Upvotes: 4
Reputation: 11721
Google App Engine provides you with an API (java and python) for running MapReduce Jobs on their App Engine. Although you cannot view all the source code (modules like scheduler, job tracker, task tracker, etc), you can view the source code for the API (which includes mappers, reducers, partitioner etc.). GAE also provides you with a Software Development Kit (SDK) on which you can test your application. After you're satisfied with your app's performance, you can then upload it on GAE and anyone can access it.
I have made one such app, its found at shaileshmapreduce.appspot.com. It won't let you run a MapReduce job, because I'll have to add your gmail id into the user list, but you can check out the interface and everything.
You can also try out their MapReduce demo https://developers.google.com/appengine/docs/python/dataprocessing/helloworld
Of course, you need to make sure that you have the SDK and the required MapReduce library installed on your machine.
Upvotes: 4
Reputation: 308763
Here's a great explanation of map reduce:
http://www.joelonsoftware.com/items/2006/08/01.html
Upvotes: 4
Reputation: 500397
Allow me to quote Wikipedia:
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).
Upvotes: 3