Reputation: 181
I have multiple columns from which I want to collect the distinct values. I can do it this way:
for c in columns:
values = dataframe.select(c).distinct().collect()
But this takes a lot of time. Is there a way of doing it for all columns at the same time?
Upvotes: 2
Views: 1675
Reputation: 10362
Use collect_set
to collect distinct values from column
function.
Sample Data in DataFrame
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 2| 4|
| 1| 3| 2|
| 4| 3| 3|
+---+---+---+
Import required functions
>>> from pyspark.sql.functions import collect_set
>>> from pyspark.sql.functions import col
collect_set
function to collect distinct values in column
>>> columnExprs = map(lambda c: collect_set(col(c)).alias(c),df.columns)
Apply columnExprs
in select
>>> df.select(*columnExprs).show()
+---------+------+---------+
| a| b| c|
+---------+------+---------+
|[1, 2, 4]|[2, 3]|[2, 3, 4]|
+---------+------+---------+
Use collect
function to collect result.
>>> df.select(*columnExprs).collect()
[Row(a=[1, 2, 4], b=[2, 3], c=[2, 3, 4])]
Upvotes: 2