Reputation: 2627
I'd like to check the distinct values for a data frame and I know there are a way ways that I can do it. I'd like to look at the unique values for columns rabbit
, platypus
and book
.
This is the first way
mydf
.select("rabbit", "platypus", "book")
.distinct
.show
This is the second way
mydf
.select("rabbit", "platypus", "book")
.distinct
.count
This is another way
val rabbit = mydf.groupByKey(log => {
val rabbit = mydf.rabbit
rabbit
}).count.collect
val platypus = mydf.groupByKey(log => {
val platypus = mydf.platypus
platypus
}).count.collect
val book = mydf.groupByKey(log => {
val book = mydf.book
book
}).count.collect
Upvotes: 1
Views: 848
Reputation: 31530
.collect
will get all the results back to driver and cause OOM errors on big datasets.
Use .distinct()
method and if you want count of distinct records then use df.distinct().count()
.
Upvotes: 2