Reputation: 33
I have been trying to learn hadoop. In the examples I saw (such as the word counting example) the key parameter of the map function is not used at all. The map function only uses the value part of the pair. So it seems to be that the key parameter is unnecessary, but it should not be. What am I missing here? Can you give me example map functions which use the key parameter?
Thanks
Upvotes: 2
Views: 1563
Reputation: 1892
In wordcount example : As we want to count the occurrence of each word in the file. we used the follwing method:
In Mapper -
Key
is the offset
of the text file.
Value
- Line
in text file.
For example. file.txt
Hi I love Hadoop.
I code in Java.
Here
Key - 0 , value - Hi I love Hadoop.
Key - 17 , value - I code in Java.
(key - 17 is offset from start of file.)
Basically the offset for key is default and we do not need it especially in Wordcount
.
Now later logic is I guess you will get here and many more available links.
Just in case:
In Reducer
Key
is the Word
Value
is 1 which is its count.
Upvotes: 2
Reputation: 1811
To understand about the use of key, you need to know various input formats available in Hadoop.
TextInputFormat - An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
NLineInputFormat- NLineInputFormat which splits N lines of input as one split. In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters. (Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text). The location hints will span the whole mapred cluster.
KeyValue TextInputFormat - An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. E ach line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.
SequenceFileAsBinaryInputFormat- InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsTextInputFormat- This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method.
Upvotes: 2