Reputation: 2180
I was going through the MapReduce Types and Formats from the book "HADOOP The Definitive Guide". The general form of MapReduce Types is given as:
map: (K1,V1) --> list(K2,V2)
reduce: (K2, list(V2)) --> list (K3, V3)
also
map: (K1,V1) --> list(K2,V2)
combine: (K2, list(V2) --> list(K2,V2)
reduce: (K2, list(V2)) --> list (K3, V3)
How can I resolve the word count problem in this general format. Suppose I've a text file:
AAA BBB CCC
DDD EEE AAA
GGG CCC BBB
Now formatting the text file in (K1, V1)
format (K1, V1)
(0,AAA BBB CCC )
(10,DDD EEE AAA )
(21,GGG CCC BBB )
How can I proceed ahead? Please correct me if there is anything mistake. Thanks!
Upvotes: 0
Views: 138
Reputation: 4111
Let's walk through the series of transformations to your data. We start with the raw data:
AAA BBB CCC
DDD EEE AAA
GGG CCC BBB
Suppose we're using TextInputFormat as the InputFormat. Then, the input to the mapper will be key, value pairs that look something like this:
0 AAA BBB CCC
13 DDD EEE AAA
26 GGG CCC BBB
Here, the file is broken up into lines. The key is the position in the file, and the value is the line itself, as a string.
The mapper needs to take these lines (ignoring their positions) and turn them into a list of key, value pairs. As an example, let us transform the first line, AAA BBB CCC
, into the list of key, value pairs:
AAA 1
BBB 1
CCC 1
We do this by splitting the string using the space character as the delimiter. Here, each key is a word and the value is the count of that word. Note that this is just one way of doing the mapping, and there are other equally valid ways of doing it.
After mapping all three lines, we get the following list of key, value pairs:
AAA 1
BBB 1
CCC 1
DDD 1
EEE 1
AAA 1
GGG 1
CCC 1
BBB 1
The above key, value pairs are sorted by key, and we get:
AAA 1
AAA 1
BBB 1
BBB 1
CCC 1
CCC 1
DDD 1
EEE 1
GGG 1
And then all values having the same key are grouped into a list (which can be done efficiently since they're sorted by key):
AAA [1, 1]
BBB [1, 1]
CCC [1, 1]
DDD 1
EEE 1
GGG 1
The combiner and the reducer in this case do the same thing, so let's assume that there is no combiner at first.
Let's also assume that there is only one reducer node. (In a multiple reducer setting, we are guaranteed that all values of the same key will go to the same reducer.)
Our reducer sums up the elements in the list for each key, yielding:
AAA 2
BBB 2
CCC 2
DDD 1
EEE 1
GGG 1
which is the desired word count.
Upvotes: 1