Reputation: 2419
I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number information based on the mapper.
For instance, suppose now I have an input.txt that is very large, each line looks like this:
1. Melo, apple, orange
2. orange, perl
3. apple, banana, car
...
10000. Apple
...
What if I want to cacluate the occurency of Apple in terms of its occurency line number, then to cacluate the relationship between these different fruits, like:
Apple => orange
Can I make the value in key/value pair like a list of line numbers. But since I've no idea how the data is partitioned for different datanodes, then the line number information of the original input file will get lost. I don't know how the data is distributed among the datanodes, is it based on the offset from the first record? Or the size of the partitioned data?
I have looked up several tutorials and I am still confused about the exact workflow of mapreduce. In addition, I'm planning to use Amazon elastic mapreduce and use Python.
Maybe I'm talking about same thing in this discussion, but as far as I know, there's no solution at that time or during that discussion. Is it right?
http://lucene.472066.n3.nabble.com/current-line-number-as-key-td2958080.html
Thanks!
Upvotes: 1
Views: 79
Reputation: 212
here is the exact workflow of mapreduce:
The input file is split into multiple chunks which will be processed by mappers, the output of each mapper will be a (key,value) pair.
Before distributing all these pairs of (key,value) to reducers, they need to be shuffled and sorted by key so that all values associated to a specific key will be sent to the same reducer.
So the reducer will get as input (Key,[value1,value2,value3,..,valuen]).
Now let's back to you example, you can use at the map level (term,line number) as (key,value) so for apple we will have : (apple,2), (apple,3) ... (apple,10000)
The reducers will receive (apple,[2,3,...,10000]) and then you can process it as you like.
Upvotes: 0