Vyassa Baratham
Vyassa Baratham

Reputation: 1467

Getting byte offset with MRJob

According to "The Definitive Guide to Hadoop", the input format TextInputFormat gives key value pairs (k, v) = (byte offset, line). However, in MRJob, the key in the mapper input is always None. It should be easy to get the byte offset as key, since that's what TextInputFormat does. How do I get this?

I know that you can use the environment variable 'map_input_start' and calculate byte offsets yourself, but this has caused problems and I would like to do it the much simpler way of just getting the offset as key.

Upvotes: 0

Views: 1017

Answers (2)

Magham Ravi
Magham Ravi

Reputation: 603

Doesn't defining the map method in your mapper class with the following signature give you the byte offset as the key.

public void map(LongWritable key,Text value,OutputCollector<>,Reporter) 

Upvotes: 0

Niels Basjes
Niels Basjes

Reputation: 10642

The TextInputFormat is a Java class ... I do not see how that would work in the streaming world.

Upvotes: 0

Related Questions