Reputation: 41
I am trying to collect the distinct values of a spark dataframe column into a list using scala. I have tried different options:
df.select(columns_name).distinct().rdd.map(r => r(0).toString).collect().toList
df.groupBy(col(column_name)).agg(collect_list(col(column_name))).rdd.map(r => r(0).toString).collect().toList
and they both work, but for the volume of my data, the process is pretty slow, so I am trying to speed things up. Does anyone have a suggestion I could try?
I am using Spark 2.1.1
thanks!
Upvotes: 0
Views: 3619
Reputation: 41957
You can try
df.select("colName").dropDuplicates().rdd.map(row =>row(0)).collect.toList
Or you can try
df.select("colName").dropDuplicates().withColumn("colName", collect_list("colName")).rdd.map(row =>row(0)).collect
Upvotes: 1