Sofia
Sofia

Reputation: 456

Implementing range partitioning for skew data in Hadoop MapReduce

I have a data set with date value as key. However, my data is skew, meaning that the more recent the date the more the records with this date as a key. Thus, hash partitioning (default in Hadoop MR) is not appropriate, as it would uniformly distribute the keys, overloading specific reducers. That's why I've decided to write a custom partitioner. Any clues on how one can implement range partitioning in Hadoop MR, because my research so far has only lead to research papers.

Upvotes: 1

Views: 410

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191983

I don't think there's much that needs researched.

The class is org.apache.hadoop.mapreduce.Partitioner, and you can optionally implement org.apache.hadoop.conf.Configurable as a means of "passing parameters" to the Partitioner. For example, the BinaryPartitioner allows you to set left and right offsets within a byte array to partition on, as compared to the hash of the key. Depending on your data, that might even be good enough

Then, by extending the Partitioner class, you must implement the getPartition method to return an integer based on your own input data and logic. You're given the number of total partitions as a parameter, so don't worry about that.

Then, you just need to specify that your job use that Partitioner in the JobConf.

If you try to do this using Spark, Hive, Pig, etc. you need to ensure your class is on the YARN classpath of your job

Upvotes: 1

Related Questions