exAres
exAres

Reputation: 4926

How to get unique values for each column in HIVE/PySpark table?

I have a table in HIVE/PySpark with A, B and C columns. I want to get unique values for each of the column like

{A: [1, 2, 3], B:[a, b], C:[10, 20]}

in any format (dataframe, table, etc.)

How to do this efficiently (in parallel for each column) in HIVE or PySpark?

Current approach that I have does this for each column separately and thus is taking a lot of time.

Upvotes: 0

Views: 283

Answers (1)

Suresh
Suresh

Reputation: 5870

We can use collect_set() from the pyspark.sql.functions module,

>>> df = spark.createDataFrame([(1,'a',10),(2,'a',20),(3,'b',10)],['A','B','C'])
>>> df.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  a| 10|
|  2|  a| 20|
|  3|  b| 10|
+---+---+---+

>>> from pyspark.sql import functions as F
>>> df.select([F.collect_set(x).alias(x) for x in df.columns]).show()
+---------+------+--------+
|        A|     B|       C|
+---------+------+--------+
|[1, 2, 3]|[b, a]|[20, 10]|
+---------+------+--------+

Upvotes: 4

Related Questions