Pig Latin JOIN performance improvement

Question

I need some tips, guidance and/or your experience with performance improvement of Pig Script execution on huge data-sets.

I am using Pig (version 0.12) and Hive (version 0.11) for the analysis of customer transactions. In my case, pig script will be scheduled on daily basis; main data contains approximately 500K to 800K records (even up to 1 million), and there will be 4 additional data sets to aid in the analysis of the main transaction data-set and will have approximately 50K records in each.

I heard that in big data processing, we should avoid JOINs, but in my case I can't; I have to join my main data-set with these 4 additional data-sets and do lots of IF-ELSE, FILTERS, JOINS etc to generate some transaction analysis report on daily basis.

When I tried on main data-set having 95K records, it took proximately 2 hours. So am scared to try on actual data-set of 1 million records.

How can I improve the performance of pig script? Best way to JOIN data-sets?

Pig Latin JOIN performance improvement

Answers (1)

Related Questions