Reputation: 105

How to make each hadoop mapper to get a file pair i.e. a whole input file (.csv) and a whole meta data file (.json)

I have hundreds to thousands of input files (.csv) and meta data files (.json) in the same folder say. $HDFS_ROOT/inputFolder

// Input data .csv files

input_1.csv, input_2.csv..input_N.csv

// Input meta data .json files

input_1.json, input_2.json..input_N.json

Can some one give me tips on how to make each mapper to get a file pair i.e. a whole input file (.csv) and its meta data file (.json).

NOTE: input_i.csv and input_i.json should go to the same mapper so that both the input and its meta data will be meaningful to validate.

What I tried: I tried using WholeFileInputFormat and WholeFileRecordReader extending from FileInputFormat and RecordReader respectively. This would suffice for .csv files only. Also, I placed the .json files in to distributed cache to be accessible by the mapper. Its not a good solution.

Upvotes: 0

Answers (1)

alexeipab

Reputation: 3619

The key to solving this problem without using costly Reducer is InputSplits. Each InputFormat has method getSplits, a single split is an input for a single Mapper, there are as many Mappers as there are InputSplits. In the mapper it is possible to gain access to the instance of the InputSplit:

@Override
protected void setup(Context context) throws IOException, InterruptedException {
     System.out.println("TRACE 1 " + context.getConfiguration().getClass().getName());
     System.out.println("TRACE 2 " + context.getTaskAttemptID().toString());
     System.out.println("TRACE 3 " + context.getInputSplit().toString());

}

Based on this there are 3 approaches which I have used in the past:

1) context.getInputSplit() returns an instance of FileSplit, which has Path getPath() method. But you have to watch out for CombineFileSplit and TaggedInputSplit which could wrap around the FileSplit. With CombineFileSplit if you do not override the default behaviour around CombineFileInputFormat.pools, than you risk to mix records with different structures in the same Mapper without being able to distinguish them;

2) A simpler approach would be use context.getInputSplit().toString(), returned string will contain the path that the InputSplit is attached to, works well with MultipleInputs, but not with CombineFileInputFormat. It is a little bit dirty as you are at the mercy of the toString() methods, would not recommend it for production system, but is good enough for quick prototypes;

3) To define your own implementation of a proxy InputFormat and InputSplit, similar to what MultipleInputs approach uses, it relies on DelegatingInputFormat which wraps around the InputSplit of the InputFormat that can read the data, but puts them inside of TaggedInputSplit, see the source code. In your case you can hide the metadata logic in the your own InputFormat and InputSplits and make Mappers free of knowing how to match file to the metadata. Also you can directly associate input paths to metadata without relying on the naming conventions. This approach is well suited for production systems.

Upvotes: 1

How to make each hadoop mapper to get a file pair i.e. a whole input file (.csv) and a whole meta data file (.json)

Answers (1)

Related Questions