Reputation: 81
I need some tips, guidance and/or your experience with performance improvement of Pig Script execution on huge data-sets.
I am using Pig (version 0.12) and Hive (version 0.11) for the analysis of customer transactions. In my case, pig script will be scheduled on daily basis; main data contains approximately 500K to 800K records (even up to 1 million), and there will be 4 additional data sets to aid in the analysis of the main transaction data-set and will have approximately 50K records in each.
I heard that in big data processing, we should avoid JOINs, but in my case I can't; I have to join my main data-set with these 4 additional data-sets and do lots of IF-ELSE, FILTERS, JOINS etc to generate some transaction analysis report on daily basis.
When I tried on main data-set having 95K records, it took proximately 2 hours. So am scared to try on actual data-set of 1 million records.
How can I improve the performance of pig script? Best way to JOIN data-sets?
Upvotes: 0
Views: 720
Reputation: 5801
It sounds like there is more going on in your data than you have mentioned. For example, you may have multiple instances of the JOIN
key in both relations being joined (this would be my guess), or perhaps your data is highly skewed to one particular key. For starters, check out this helpful chart guiding you through how to optimize your JOIN
s.
Since your additional data sets have only 50K records each, they should probably be able to fit in memory unless each record is huge. In that case you can use the USING 'replicated'
clause to avoid a reduce phase.
Upvotes: 2