sai
sai

Reputation: 97

Custom Dynamic Partitions in MapReduce

I am using MapReduce to process my data. I need the output to be stored under date partitions. My sort key is a date string. Now if I override getPartition in my custom partitioner class to return the following:

return (formattedDate.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

Because as we are using hash and Mod, in some cases we return a same integer value eg: Let's say numReduceTasks=100

Now the date 2018-01-20 might have hash value as 101. so 101%100 = 1

Now take other date as 2018-02-20 and might have hash value as 201. so 201%100 = 1 and because of this we are ending up with multiple date files going to single date partition. which is not desired. Any pointers on how to handle this?

Upvotes: 0

Views: 319

Answers (2)

sai
sai

Reputation: 97

Multiple Formats is the solution that worked.It works with creating directories too. Definitive guide helped me with this.

The base path specified in the write() method of MultipleOutputs is interpreted relative to the output directory, and because it may contain file path separator characters (/), it’s possible to create subdirectories of arbitrary depth. For example, the following modification partitions the data by station and year so that each year’s data is contained in a directory named by the station ID (such as 029070-99999/1901/part-r-00000)

Upvotes: 0

Arun A K
Arun A K

Reputation: 2225

I think in this case you should not explore using Partitioners and multiple reducers. Unless you know how many unique dates are in the dataset, you will not be able to set the number of reducers.

Use MultipleOutputs instead to send outputs to multiple files. (Files, not directory though). If you need to send them across separate directories, you could have a step in your driver calls after the MR that will iterate the output directory and move files to directories based on file name start pattern which in this case will be the date value.

For an example using MO, see this.

Another option will be to run a normal map reduce, store the output to regular o/p dir, create a hive table on top of it and perform dynamic partitioning to send outputs to different dirs based on your date column.

Upvotes: 1

Related Questions