Nati Krisi
Nati Krisi

Reputation: 1041

processing subset of a file in mapreduce

i need to process a huge file using mapreduce and i was required a away to let the end users select how many records they want to process.

The problem is that there isn’t any effective way to process only subset of the file without "mapping" the whole file (25tb file)

is there a way to stop mapping after specific number of record and continue with the reduce part?

Upvotes: 1

Views: 590

Answers (2)

Amar
Amar

Reputation: 12010

There is a very simple and elegant solution to this problem: Override the run() of org.apache.hadoop.mapreduce.Mapper class and only execute map() till you want or only for those records which you need/want.

See the following:

public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {

    private Text outputKey = new Text();
    private Text outputValue = new Text();
    private int numberOfRecordsToProcess;

    // read numberOfRecordsToProcess in setup method from the configuration values set in the driver class after getting input from user

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
     // Do your map thing
    }

    @Override
    public void run(Context context) throws IOException, InterruptedException {

        setup(context);
        int count = 0 ;
        while (context.nextKeyValue()) {
            if(count++<numberOfRecordsToProcess){ // check if enough records has been processed already
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }else{
                break;
            }
        }
    }

    cleanup(context);
}

Upvotes: 2

Ronak Patel
Ronak Patel

Reputation: 3849

How to create output files with fixed number of lines in hadoop/map reduce? , you may use information from this link to run N number of lines as mapper input and runing only one mapper from main class as

setNumMapTasks(int) 

Upvotes: 0

Related Questions