Reputation: 344

Get unique line number from a input file in MapReduce mapper

I have copied a single file with 10 million rows in HDFS. Need to process line number 5000 to 500000 in mapper. How should I do this?

I tried overriding run() method in mapper and attempted a counter there. But when the file is split and multiple mappers are running, there are of course multiple counters running. So it doesn't help. Pasting the code below.

@Override
public void run(Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {

    setup(context);

    Integer counter = 0;
    while (context.nextKeyValue()) {

        LongWritable currentKey = context.getCurrentKey();
        Text currentValue = context.getCurrentValue();

        System.out.println(currentKey.toString());

        map(currentKey, currentValue, context);
        counter++;
    }

    System.out.println("Counter: " + counter + " Time: "
            + System.currentTimeMillis());
}

Also the KEY I get in mapper is not the line number but the offset of line. Can we get the KEY pointing to line number? If so will it be unique across multiple mappers? (current KEY, the offset, is not unique across mappers).

How can I get it right?

Upvotes: 2

Answers (2)

frb

Reputation: 3798

I would try to add those line numbers in a first MapReduce job. Then, you can perform your MapReduce job, including within the Mapper some code in charge of inspecting the line number in order to discard the whole line or perform your analysis.

EDIT: I'm now thinking the first MR job cannot be implemeted since the problem at the mappers will be the same than the original problem: they will receive splits with no reference at all regarding its position within the whole big file.

Upvotes: 0

Karthik

Reputation: 1811

The default InputFormats such as TextInputFormat will give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.
You can create your own InputFormat to produce line numbers rather than byte offsets but you need to configure input format to return false from the isSplittable method (a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this is not a problem.
You can also use pig to clean your data and get those particular interested lines and process that particular data .

I feel this is a draw back of Hadoop, Hadoop fails when you want to share global state across different systems.

Upvotes: 3

Get unique line number from a input file in MapReduce mapper

Answers (2)

Related Questions