Fast way to collect spark dataframe column value into a list (scala)

Question

I am trying to collect the distinct values of a spark dataframe column into a list using scala. I have tried different options:

df.select(columns_name).distinct().rdd.map(r => r(0).toString).collect().toList
df.groupBy(col(column_name)).agg(collect_list(col(column_name))).rdd.map(r => r(0).toString).collect().toList

and they both work, but for the volume of my data, the process is pretty slow, so I am trying to speed things up. Does anyone have a suggestion I could try?

I am using Spark 2.1.1

thanks!

Fast way to collect spark dataframe column value into a list (scala)

Answers (1)

Related Questions