user4444053
user4444053

Reputation:

Is (key,value) pair in Hadoop always ('text',1)?

I am new to Hadoop.

Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?

Please help me.

Upvotes: 2

Views: 3467

Answers (2)

frb
frb

Reputation: 3798

I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.

Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:

  • Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...

  • Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.

This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.

Upvotes: 2

suresiva
suresiva

Reputation: 3173

Hope you have started learning mapreduce with Wordcount example..

Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.

Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,

http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html

The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.

Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...

Upvotes: 0

Related Questions