processing subset of a file in mapreduce

Question

i need to process a huge file using mapreduce and i was required a away to let the end users select how many records they want to process.

The problem is that there isn’t any effective way to process only subset of the file without "mapping" the whole file (25tb file)

is there a way to stop mapping after specific number of record and continue with the reduce part?

Amar · Accepted Answer

There is a very simple and elegant solution to this problem: Override the run() of org.apache.hadoop.mapreduce.Mapper class and only execute map() till you want or only for those records which you need/want.

See the following:

public static class MapJob extends Mapper {

    private Text outputKey = new Text();
    private Text outputValue = new Text();
    private int numberOfRecordsToProcess;

    // read numberOfRecordsToProcess in setup method from the configuration values set in the driver class after getting input from user

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
     // Do your map thing
    }

    @Override
    public void run(Context context) throws IOException, InterruptedException {

        setup(context);
        int count = 0 ;
        while (context.nextKeyValue()) {
            if(count++

processing subset of a file in mapreduce

Answers (2)

Related Questions