How will Pig handle Skewed Joins?

Question

When joining datasets, you have an option to tell Pig that the keys might be skewed like the statement below.

... JOIN data1 BY my-join-key USING ‘skewed’ …

PIG will get an estimate of my-join-key values to see if there are some values that occur with much higher frequency than others. There is some overhead cost for doing this (10% or so, but it depends on many factors).

How is this information exactly used in map/reduce jobs? If there is skew, then will PIG try to partition keys to be more balanced across reducers?

In this scenario will PIG replicate the smaller dataset across mapper tasks or it will just use more reducers?

rahulbmv · Accepted Answer

As per documentation

Skewed join does not place a restriction on the size of the input keys. It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is sampled to create the histogram.

Skewed join can be used when the underlying data is sufficiently skewed and you need a finer control over the allocation of reducers to counteract the skew. It should also be used when the data associated with a given key is too large to fit in memory.

Pig spawns a mapper which parses the data and observes the key distribution, based on which reducer key allocation is made.

Pig makes no attempt to replicate the smaller dataset to the mappers (Think you mean replicated join here). The right side of the join is streamed to the reducer splits based upon the skew in the left side of the join.

How will Pig handle Skewed Joins?

Answers (1)

Related Questions