Reputation: 3324
I am trying to aggregate a column in a Spark dataframe using Scala, like so:
import org.apache.spark.sql._
dfNew.agg(countDistinct("filtered"))
but I get the error:
error: value agg is not a member of Unit
Can anyone explain why?
EDIT: to clarify what I am trying to do: I have a column which is a string array, and I want to count the distinct elements over all the rows, not interested in any other columns. Data:
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, 😍🔥, https://time.com/sxp3onz1w8] |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay] |
And I want to count filtered, giving:
rt:2, @dope_promo:1, crew:1, ...frog:2 etc
Upvotes: 3
Views: 9724
Reputation: 27373
You need to explode
your array first before you can count occurences: to view the counts of each element:
dfNew
.withColumn("filtered",explode($"filtered"))
.groupBy($"filtered")
.count
.orderBy($"count".desc)
.show
or just to get the count of distinct elements :
val count = dfNew
.withColumn("filtered",explode($"filtered"))
.select($"filtered")
.distinct
.count
Upvotes: 2