Dusty
Dusty

Reputation: 181

Get distinct values of multiple columns

I have multiple columns from which I want to collect the distinct values. I can do it this way:

for c in columns:
   values = dataframe.select(c).distinct().collect()

But this takes a lot of time. Is there a way of doing it for all columns at the same time?

Upvotes: 2

Views: 1675

Answers (1)

s.polam
s.polam

Reputation: 10362

Use collect_set to collect distinct values from column function.

Sample Data in DataFrame

>>> df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  2|  4|
|  1|  3|  2|
|  4|  3|  3|
+---+---+---+

Import required functions

>>> from pyspark.sql.functions import collect_set
>>> from pyspark.sql.functions import col

collect_set function to collect distinct values in column

>>> columnExprs = map(lambda c: collect_set(col(c)).alias(c),df.columns)

Apply columnExprs in select

>>> df.select(*columnExprs).show()
+---------+------+---------+
|        a|     b|        c|
+---------+------+---------+
|[1, 2, 4]|[2, 3]|[2, 3, 4]|
+---------+------+---------+

Use collect function to collect result.

>>> df.select(*columnExprs).collect()
[Row(a=[1, 2, 4], b=[2, 3], c=[2, 3, 4])]

Upvotes: 2

Related Questions