schoon
schoon

Reputation: 3324

How do I use countDistinct in Spark/Scala?

I am trying to aggregate a column in a Spark dataframe using Scala, like so:

import org.apache.spark.sql._

dfNew.agg(countDistinct("filtered"))

but I get the error:

 error: value agg is not a member of Unit

Can anyone explain why?

EDIT: to clarify what I am trying to do: I have a column which is a string array, and I want to count the distinct elements over all the rows, not interested in any other columns. Data:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered                                                                                                                                                      |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, 😍🔥, https://time.com/sxp3onz1w8]                                                                      |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                                                                                |

And I want to count filtered, giving:

rt:2, @dope_promo:1, crew:1, ...frog:2 etc

Upvotes: 3

Views: 9724

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

You need to explode your array first before you can count occurences: to view the counts of each element:

dfNew
.withColumn("filtered",explode($"filtered"))
.groupBy($"filtered")
.count
.orderBy($"count".desc)
.show

or just to get the count of distinct elements :

val count = dfNew
.withColumn("filtered",explode($"filtered"))
.select($"filtered")
.distinct
.count

Upvotes: 2

Related Questions