Reputation: 396

How to force file content to be processed sequenctially?

I got a requirement to process the file as it is means the file content should be processed as it appears in the file.

For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).

One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.

Any thoughts :)

Upvotes: 0

Answers (1)

Matthias Kricke

Reputation: 4971

You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat which sets isSplitable to false. E.g.

public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
        @Override
        protected boolean isSplitable(FileSystem fs, Path filename) {
            return false;
        }


        @Override
        public RecordReader<Text, BytesWritable> getRecordReader(
          InputSplit split, JobConf job, Reporter reporter) throws IOException {
            return new WholeFileRecordReader((FileSplit) split, job);
        }
}

For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.

Upvotes: 2

How to force file content to be processed sequenctially?

Answers (1)

Related Questions