Reputation: 69
I want to get count of customer based on all combination of columns i have in dataframe.
For eg: - Suppose if I have dataframe with 5 columns.
id, col1, col2, col3, cust_id
I need the count of customer for all combination:
id, col1, count(cust_id)
id, col1, col2, count(cust_id)
id, col1, col3, count(cust_id)
id, col1, col2, col3, count(cust_id)
id, col2, count(cust_id)
id, col2, col3, count(cust_id)
And so on for all permutation and combination.
Its very difficult to do it separately providing all different combination to groupBy function of data frame and then aggregate the count of customer.
Is there any way we can achieve this and then combine all the result to add it in one data frame to that we can write the result in one output file.
to me it looks bit complex, really appreciate if any one can provide any solution. please let me know if any more details are required.
Thanks a lot.
Upvotes: 0
Views: 1224
Reputation: 35229
It is possible and it is called cube
:
df.cube("id", "col1", "col2", "col3").agg(count("cust_id"))
.na.drop(minNonNulls=3) // To exclude some combinations
SQL version also provides a GROUPING SET
which can be more efficient than .na.drop
.
Upvotes: 1