GroupBy of huge spark dataframe

Question

I have a dataframe with thousands of millions of rows with columns 'A', 'B' and others. The df is saved in parquet format partitioned by 'A'. If I run:

df.groupBy('A').agg(agg_functions)

It works, but if I run:

df.groupBy('B').agg(agg_functions)

The process fails because of lack of memory (it tries to bring all data to a executor). I know there is a relation between A, B: the same value for B can only appear in two consecutive partitions of A. Is there any way to use this fact to perform the operation efficiently?

GroupBy of huge spark dataframe

Answers (1)

Related Questions