Reputation: 1232
I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)" but I need only the value as I will use it for another part of my code.
So, ideally only all_values=[0,1,2,3,4]
all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values
[Row(no_children=0),
Row(no_children=1),
Row(no_children=2),
Row(no_children=3),
Row(no_children=4)]
This takes around 15secs to run, is that normal?
Thank you very much!
Upvotes: 10
Views: 20187
Reputation: 2897
Try this:
all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()
Upvotes: 3
Reputation: 5870
You can use collect_set from functions module to get a column's distinct values.Here,
from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
| 0|
| 3|
| 2|
| 4|
| 1|
| 4|
+-----------+
>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]
Upvotes: 17
Reputation: 560
You could do something like this to get only the values
list = [r.no_children for r in all_values]
list
[0, 1, 2, 3, 4]
Upvotes: 1