Reputation: 101

Divide operation in spark using RDD or dataframe

Suppose there is a dataset with some number of rows.

I need to find out the Heterogeneity i.e.

distinct number of rows divide by total number of rows.

Please help me with spark query to execute the same.

Upvotes: 1

Answers (1)

Reputation: 6974

Dataset and dataframe supports distinct function which finds distinct rows in the dataset.

So essentially you need to do

val heterogeneity = dataset.distinct.count / dataset.count

Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.

Upvotes: 1