What can cause Hadoop to skip the sorting step?

Question

I'm trying to use Hadoop to format and sort a very large dataset, but it seems to be skipping the sort step. The mapper transforms an Avro input file into a few interesting fields in JSON.

void map(AvroWrapper wrappedAvroDatum, NullWritable nothing,
         OutputCollector collector, Reporter reporter) {
    Datum datum = wrappedAvroDatum.datum();
    if (interesting(datum)) {
        Long time = changeTimeZone(datum.getTime());
        String key = "%02d".format(month(time));
        String value = "{\"time\": %d, \"other-stuff\": %s, ...}".format(time, datum.getOtherStuff());
        collector.collect(new Text(key), new Text(value));
    }
}

The reducer assumes that the values for each key are in lexicographical order (appropriate for org.apache.hadoop.io.Text, right?) and just strips the keys so that I get a text file, one JSON object per line.

void reduce(Text key, java.util.Iterator values,
            OutputCollector collector, Reporter reporter) {
    while (values.hasNext()) {
        collector.collect(NullWritable.get, new Text(values.next()));
    }
}

I'm expecting text files that are sorted in blocks of one month (that is, I don't expect the months to be in order, but I expect times within each month to be in order). What I get are text files that are grouped by month but completely unsorted. Clearly, Hadoop is grouping the Text records by their key value, but it is not sorting them.

(Known issues: I'm relying on the fact that "time" comes first in my JSON object and has exactly the same number of digits for all records, so that lexicographical order is numerical order. This is true for my data.)

When I used Hadoop Streaming (not an option in this project), text lines were automatically sorted--- the sorting could be configured, but by default it did what I wanted. In raw Hadoop, does sorting need to be turned on somehow? If so, how? If it's supposed to be on by default, where can I start looking to debug this problem?

I'm observing this behavior in Cloudera's CDH4 Hadoop-0.20 package in pseudodistributed mode and on Amazon's Elastic Map-Reduce (EMR).

cabad · Accepted Answer

Hadoop sorts the keys, not the values. This means the results you are getting are correct. Hadoop has not skipped the sort phase; it is actually sorting the keys.

You could design your own Writable type to use a composite key and ensure the type of sorting you want. This other SO question explains how to do this.

Finally, this other SO question gives more information on how the shuffle & sort phase works in Hadoop.

What can cause Hadoop to skip the sorting step?

Answers (1)

Related Questions