konstantin
konstantin

Reputation: 725

Hadoop to reduce from multiple input formats

I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?

e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.

How to do that? Or is there a better solution than mine?

Upvotes: 3

Views: 4848

Answers (1)

Donald Miner
Donald Miner

Reputation: 39913

Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.

If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.


On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.

This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.

Upvotes: 4

Related Questions