Reputation: 1041
i need to process a huge file using mapreduce and i was required a away to let the end users select how many records they want to process.
The problem is that there isn’t any effective way to process only subset of the file without "mapping" the whole file (25tb file)
is there a way to stop mapping after specific number of record and continue with the reduce part?
Upvotes: 1
Views: 590
Reputation: 12010
There is a very simple and elegant solution to this problem:
Override the run()
of org.apache.hadoop.mapreduce.Mapper
class and only execute map()
till you want or only for those records which you need/want.
See the following:
public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
private int numberOfRecordsToProcess;
// read numberOfRecordsToProcess in setup method from the configuration values set in the driver class after getting input from user
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Do your map thing
}
@Override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
int count = 0 ;
while (context.nextKeyValue()) {
if(count++<numberOfRecordsToProcess){ // check if enough records has been processed already
map(context.getCurrentKey(), context.getCurrentValue(), context);
}else{
break;
}
}
}
cleanup(context);
}
Upvotes: 2
Reputation: 3849
How to create output files with fixed number of lines in hadoop/map reduce? , you may use information from this link to run N number of lines as mapper input and runing only one mapper from main class as
setNumMapTasks(int)
Upvotes: 0