Reputation: 3
I'm using MpaReduce
based on hadoop 2.6.0
,and I want to skip the first six lines of my data file, so I use
if(key.get()<6)
return ;
else
{do ....}
in my map() function
.
But it was not right. I find that the input key
of map()
is not the offset
of file line. The key is the sum of the length of every line. Why? It doesn't look like the words in many books.
Upvotes: 0
Views: 420
Reputation: 20969
If you look at the code, it is the actual byte offset of the file and not the line.
If you want to skip the first n lines of your file, you probably have to write your own input format / record reader, or make sure that you keep a line counter in the mapper logic ala:
int lines = 0;
public void map(LongWritable key, Text value, ...) {
if(++lines < 6) { return; }
}
This obviously doesn't work if you split the text file (so having > 1 mapper). So writing a dedicated InputFormat
is the cleanest way to solve this problem.
Another trick would be to measure how many bytes the first n lines are in that specific file and then just skipping this amount of bytes at the start.
Upvotes: 1