pig - need tips after performance tuning gone wrong

Question

I have a Pig script that took around 10 minutes to finish and I thought that there was still room for some performance improvement.

So, I started by putting the JOINs and GROUPs in a nested FOREACH and also putting the previous FILTERs inside the same FOREACH.

I also added using 'replicated'.

The problem now is that instead of taking 10 minutes, it's taking over 30 minutes.

Is there a place that has best practices and performance improvement tips besides PIG's documentation?

So that you can get a better picture, here's some code:

--before
previous_join = JOIN A by id, B by id --for symplification
filtering = FILTER previous_join BY ((year_min > 1995 ? year_min - 1 : year_min) <= list_year and (year_max > 2015 ? year_max - 1 : year_max) >= list_year);

final_filtered = FOREACH filtering GENERATE user_id as user_id,  list_year;

--after
final_filtered = FOREACH (JOIN A by id, B by id) {
   tmp = FILTER group BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year and (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year and A::premium == 'true');
   GENERATE A::user_id AS user_id, B::list_year AS list_year;
};

Am I doing something wrong or is this the wrong approach?

Thanks.

mbaxi · Accepted Answer

In prior case [before] you are performing filter and projection after the join is performed. It will be helpful if you calculate time log for each operation and identify the bottleneck operation.

Can you also try splitting your filter statements in multiple relations rather than just one and check the difference in filter timing?

filter_by_min_year = FILTER previous_join BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year);
filter_by_max_year = FILTER filter_by_min_year BY (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year);

Overall you want to find ids(+some more columns) with A::year_min <=B::list_year and A::year_max >= B::list_year Instead of performing join on raw A & B, you can try using projections on both of them to contain only columns needed for join and later operations.

A-projected = foreach A generate id, year_min, year_max;
B-projected = foreach B generate id, list_year;
C = join A-projected by id, B-projected by id USING 'replicated';

If any of A-projected or B-projected is a small set that can be loaded in memory use replicated join, I am assuming B-projected to be a smaller set than A-projected. If this doesnt apply to your case, please skip this option.

Also you can try setting the number of reducers to be used for this join by using PARALLEL keyword.

After applying filter you will get a list of required id's that you can use to fetch other information from A or B.

Also consider tweaking MapReduce properties like io.sort.mb, mapred.job.shuffle.input.buffer.percent etc.

Hope this helps.

pig - need tips after performance tuning gone wrong

Answers (1)

Related Questions