Reputation: 426

Purpose of Hadoop MapReduce

Currently I am reading some papers about Hadoop and the popular MapReduce algorithm. However, I could not see the value of the MapReduce and will be glad if someone can give some insight about it. Specifically:

It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What may be the purpose of counting the number of words?
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?

The questions may seem to be correlated with each other but I believe that I gave the idea about my question. I will be glad if you can give answers to the above questions.

Regards,

Edit:

Hi Guys,

Thank you very much for your responses. What I understood from your writings and playing with Hadoop a little bit, I would like to state my conclusions in a very high-level basic way:

Hadoop process data through key-value pairs. Everything is converted to key-value pairs.
The main interest should be given to the definitions of the key and the value which may change according to business needs.
Hadoop provides just an efficient (e.g. distributed, scalable system and huge amount of data handling) implementation of a dictionary, nothing more.

Any comments on these outcomes are welcome.

As a final note I would like to add that, for a simple map-reduce implementation I believe that there should be a user interface which enables user to select/define the keys and appropriate values. This UI may also be extended for further statistical analysis.

Regards,

Upvotes: 1

Answers (2)

Ravindra babu

Reputation: 38950

Take an example for Word count example to get better understanding.

What is a key? Just a word, a combination of words or something else?

For Mapper:

Key is offset value from beginning of the file. Value is entire line. Once the line is read from file, the line will be split into multiple key value pairs for Reducer. Delimiter like tab or space or characters like , : helps to split line to key value pairs.

For Reducer:

Key is individual word. Value is occurrence of the word.

Once you get key value pairs at reducer, you can run many aggregation/stigmatization/categorization of data and provide analytical summary of data.

Have a look at this use case article which covers Financial, Energy, Telecom, Retail etc.

Have a look at this article for better understanding of entire word count example and Map reduce tutorial.

what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.

Hadoop have four key components.

1. Hadoop Common: The common utilities that support the other Hadoop modules.

2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

3. Hadoop YARN: A framework for job scheduling and cluster resource management.

4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?

Creating the dictionary is not the core purpose. Hadoop is creating this dictionary and uses these key value pairs to solve business use cases later depending on requirement.

Word count example may provide output as just Word & Word count. But you can process Structured/Semi-Sturctured & Un-Structured data for various use cases

Find the hottest day of year/month/day/hour for a given place in entire universe.
Find the number of buy/sell transactions of a particular stock in NYSE on a given day. Provide Minute wise/hour wise/Day wise summary of transactions per a stock. Find top 10 highly traded stocks on a given day
Find the number of tweets/re-tweets for a particular tag key

What may be the purpose of counting the number of words?

Explained the purpose in earlier answers.

I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?

How much data volume you can handle by writing C# to get key value pairs and process data? Can you process 10 peta bytes of weather information in 5000 node cluster using C# with distributed storage/processing framework developed in C#?

How do you summarize the data Or find top 10 cool/hot places using C#?

You have to develop some framework to do all of these things and Hadoop has already come-up with that framework.

HDFS is used for distributed storage of data in volumes of peta bytes. If you need to handle data growth, just add some more nodes to hadoop cluster.
Hadoop Map reduce & YARN provide framework for distributed data processing to process data stored in thousands of machines in Hadoop cluster.

Image source: kickstarthadoop ( article author: Bejoy KS)

Upvotes: 3

Durga Viswanath Gadiraju

Reputation: 3966

It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something
else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.

MapReduce should be visualized as distributed computing framework. For word count example the key is word, but we can have any thing as key (APIs are available for some of them and we can write custom ones as well). The purpose of having the key is to partition, sort and merge the sorted data to perform aggregations. A map phase will be used to perform row level transformations, filtering etc and reduce phase will take care of aggregation. Map and Reduce needs to be implemented and then shuffle phase which is typically out of the box will take care of partitioning, shuffling, sorting and merging.

If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be
Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?

Covered as part of previous question.

What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What

may be the purpose of counting the number of words?

You can perform transformations, filtering, aggregations, joins and any custom task that can be performed on unstructured data. The major difference is distributed. Hence it can scale better than any legacy solutions.

It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that
Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?

Key can be line offset and then you can process each record. It does not matter if every record is of same structure or different.

Here are the advantages of using Hadoop:

Distributed file system (HDFS)
Distributed processing framework (map reduce)
Data locality (typically in modern applications, files will be network mounted and hence data which is bigger than code has to be copied to servers on which code is deployed. In hadoop code goes to data and all the success stories of Hadoop does not use network file system)
Limited usage of network while storing and processing very large data sets
Cost effective (open source softwares on commodity hardware) and many more.

Upvotes: 3

Purpose of Hadoop MapReduce

Answers (2)

Related Questions