FlashXT
FlashXT

Reputation: 3

What's the Input key of MapReduce by default?

I'm using MpaReduce based on hadoop 2.6.0,and I want to skip the first six lines of my data file, so I use

if(key.get()<6) 
   return ; 
else 
   {do ....} 

in my map() function.

But it was not right. I find that the input key of map() is not the offset of file line. The key is the sum of the length of every line. Why? It doesn't look like the words in many books.

Upvotes: 0

Views: 420

Answers (1)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

If you look at the code, it is the actual byte offset of the file and not the line.

If you want to skip the first n lines of your file, you probably have to write your own input format / record reader, or make sure that you keep a line counter in the mapper logic ala:

 int lines = 0;
 public void map(LongWritable key, Text value, ...) {
   if(++lines < 6) { return; }

 }

This obviously doesn't work if you split the text file (so having > 1 mapper). So writing a dedicated InputFormat is the cleanest way to solve this problem.

Another trick would be to measure how many bytes the first n lines are in that specific file and then just skipping this amount of bytes at the start.

Upvotes: 1

Related Questions