Reputation: 101
Suppose there is a dataset with some number of rows.
I need to find out the Heterogeneity i.e.
distinct number of rows divide by total number of rows.
Please help me with spark query to execute the same.
Upvotes: 1
Views: 72
Reputation: 6974
Dataset and dataframe supports distinct function which finds distinct rows in the dataset.
So essentially you need to do
val heterogeneity = dataset.distinct.count / dataset.count
Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.
Upvotes: 1