Reputation: 396
I got a requirement to process the file as it is means the file content should be processed as it appears in the file.
For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).
One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.
Any thoughts :)
Upvotes: 0
Views: 52
Reputation: 4971
You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat
which sets isSplitable
to false. E.g.
public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
@Override
public RecordReader<Text, BytesWritable> getRecordReader(
InputSplit split, JobConf job, Reporter reporter) throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.
Upvotes: 2