vrivesmolina
vrivesmolina

Reputation: 41

Fast way to collect spark dataframe column value into a list (scala)

I am trying to collect the distinct values of a spark dataframe column into a list using scala. I have tried different options:

and they both work, but for the volume of my data, the process is pretty slow, so I am trying to speed things up. Does anyone have a suggestion I could try?

I am using Spark 2.1.1

thanks!

Upvotes: 0

Views: 3619

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can try

df.select("colName").dropDuplicates().rdd.map(row =>row(0)).collect.toList

Or you can try

df.select("colName").dropDuplicates().withColumn("colName", collect_list("colName")).rdd.map(row =>row(0)).collect

Upvotes: 1

Related Questions