Reputation: 743
Consider I have large input of below format
1,2,6,4
4,5,18,7
9,1,3,5
......
Output should be its transpose
1 4 9 ..
2 5 1 ..
6 6 3 ..
4 7 5 ..
In this case Row number is not specified. Column number we can get while parsing Assume that file is very large and will be split for multiple mappers. Since the row number is not specified, It won't be possible to identify the order of output from each mapper. Hence, Is it possible to pre-process the input file using another mapreduce program and provide a row number before the file being sent to the Mapper?
Upvotes: 0
Views: 1580
Reputation: 2538
When you use a TextInputFormat
you get the position in the input file as a LongWritable
key. Although it is not actualy the row
, you can use it to sort columns when doing an output. So the whole map reduce job would look something like this:
public static class TransposeMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
long column = 0;
long somethingLikeRow = key.get();
for (String num : value.toString().split(",")) {
context.write(new LongWritable(column), new Text(somethingLikeRow + "\t" + num));
++column;
}
}
}
public static class TransposeReducer extends Reducer<LongWritable, Text, Text, NullWritable> {
@Override
protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
TreeMap<Long, String> row = new TreeMap<Long, String>(); // storing values sorted by positions in input file
for (Text text : values) {
String[] parts = text.toString().split("\t"); // somethingLikeRow, value
row.put(Long.valueOf(parts[0]), parts[1]);
}
String rowString = StringUtils.join(row.values(), ' '); // i'm using org.apache.commons library for concatenation
context.write(new Text(rowString), NullWritable.get());
}
}
Upvotes: 1