Reputation: 426
Currently I am reading some papers about Hadoop and the popular MapReduce algorithm. However, I could not see the value of the MapReduce and will be glad if someone can give some insight about it. Specifically:
It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What may be the purpose of counting the number of words?
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
The questions may seem to be correlated with each other but I believe that I gave the idea about my question. I will be glad if you can give answers to the above questions.
Regards,
Edit:
Hi Guys,
Thank you very much for your responses. What I understood from your writings and playing with Hadoop a little bit, I would like to state my conclusions in a very high-level basic way:
Any comments on these outcomes are welcome.
As a final note I would like to add that, for a simple map-reduce implementation I believe that there should be a user interface which enables user to select/define the keys and appropriate values. This UI may also be extended for further statistical analysis.
Regards,
Upvotes: 1
Views: 1127
Reputation: 38910
Take an example for Word count example to get better understanding.
What is a key? Just a word, a combination of words or something else?
For Mapper:
Key is offset value
from beginning of the file. Value is entire line
. Once the line is read from file, the line will be split into multiple key value pairs for Reducer. Delimiter like tab or space or characters like , : helps to split line to key value pairs.
For Reducer:
Key is individual word
. Value is occurrence
of the word.
Once you get key value pairs at reducer, you can run many aggregation/stigmatization/categorization of data and provide analytical summary of data.
Have a look at this use case article which covers Financial, Energy, Telecom, Retail etc.
Have a look at this article for better understanding of entire word count example and Map reduce tutorial.
what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
Hadoop have four key components.
1. Hadoop Common
: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS™)
: A distributed file system that provides high-throughput access to application data.
3. Hadoop YARN
: A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce:
A YARN-based system for parallel processing of large data sets.
May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
Creating the dictionary is not the core purpose. Hadoop is creating this dictionary and uses these key value pairs to solve business use cases later depending on requirement.
Word count example may provide output as just Word & Word count. But you can process Structured/Semi-Sturctured & Un-Structured data for various use cases
What may be the purpose of counting the number of words?
Explained the purpose in earlier answers.
I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
How much data volume you can handle by writing C# to get key value pairs and process data? Can you process 10 peta bytes of weather information in 5000 node cluster using C# with distributed storage/processing framework developed in C#?
How do you summarize the data Or find top 10 cool/hot places using C#?
You have to develop some framework to do all of these things and Hadoop has already come-up with that framework.
HDFS
is used for distributed storage of data in volumes of peta bytes. If you need to handle data growth, just add some more nodes to hadoop cluster.
Hadoop Map reduce & YARN
provide framework for distributed data processing to process data stored in thousands of machines in Hadoop cluster.
Image source: kickstarthadoop ( article author: Bejoy KS
)
Upvotes: 3
Reputation: 3956
It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something
else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
MapReduce should be visualized as distributed computing framework. For word count example the key is word, but we can have any thing as key (APIs are available for some of them and we can write custom ones as well). The purpose of having the key is to partition, sort and merge the sorted data to perform aggregations. A map phase will be used to perform row level transformations, filtering etc and reduce phase will take care of aggregation. Map and Reduce needs to be implemented and then shuffle phase which is typically out of the box will take care of partitioning, shuffling, sorting and merging.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be
Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
Covered as part of previous question.
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What
may be the purpose of counting the number of words?
You can perform transformations, filtering, aggregations, joins and any custom task that can be performed on unstructured data. The major difference is distributed. Hence it can scale better than any legacy solutions.
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that
Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
Key can be line offset and then you can process each record. It does not matter if every record is of same structure or different.
Here are the advantages of using Hadoop:
Upvotes: 3