yuvraj rajpurohit
yuvraj rajpurohit

Reputation: 69

Aggregate on multiple columns in spark dataframe (all combination)

I want to get count of customer based on all combination of columns i have in dataframe.

For eg: - Suppose if I have dataframe with 5 columns.

id, col1, col2, col3, cust_id

I need the count of customer for all combination:

    id, col1, count(cust_id)
    id, col1, col2, count(cust_id)
    id, col1, col3, count(cust_id)
    id, col1, col2, col3, count(cust_id)
    id, col2, count(cust_id)
    id, col2, col3, count(cust_id)

And so on for all permutation and combination.

Its very difficult to do it separately providing all different combination to groupBy function of data frame and then aggregate the count of customer.

Is there any way we can achieve this and then combine all the result to add it in one data frame to that we can write the result in one output file.

to me it looks bit complex, really appreciate if any one can provide any solution. please let me know if any more details are required.

Thanks a lot.

Upvotes: 0

Views: 1224

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

It is possible and it is called cube:

df.cube("id", "col1", "col2", "col3").agg(count("cust_id"))
  .na.drop(minNonNulls=3)  // To exclude some combinations

SQL version also provides a GROUPING SET which can be more efficient than .na.drop.

Upvotes: 1

Related Questions