Reputation: 691
I have a dataframe with thousands of millions of rows with columns 'A', 'B' and others. The df is saved in parquet format partitioned by 'A'. If I run:
df.groupBy('A').agg(agg_functions)
It works, but if I run:
df.groupBy('B').agg(agg_functions)
The process fails because of lack of memory (it tries to bring all data to a executor). I know there is a relation between A, B: the same value for B can only appear in two consecutive partitions of A. Is there any way to use this fact to perform the operation efficiently?
Upvotes: 0
Views: 437
Reputation: 717
One approach could be to group by two columns and aggregate twice. For example, If i had three columns (country, city and orders), assume in this case your column 'A' is country and City is 'B'. Now if i want to get all the orders grouped by country, I could do,
df.groupBy("country").agg(count)
Some partitions can be huge like your case, so i could rather do something like,
intermediateResults = df.groupBy("country","city").agg(count)
intermediateResults.groupBy("country").agg(sum)
You might not always have this relation between columns, in which case you could split it up across time or id's.
Upvotes: 1