Reputation: 4926
I have a table in HIVE/PySpark with A, B and C columns. I want to get unique values for each of the column like
{A: [1, 2, 3], B:[a, b], C:[10, 20]}
in any format (dataframe, table, etc.)
How to do this efficiently (in parallel for each column) in HIVE or PySpark?
Current approach that I have does this for each column separately and thus is taking a lot of time.
Upvotes: 0
Views: 283
Reputation: 5870
We can use collect_set()
from the pyspark.sql.functions
module,
>>> df = spark.createDataFrame([(1,'a',10),(2,'a',20),(3,'b',10)],['A','B','C'])
>>> df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| a| 10|
| 2| a| 20|
| 3| b| 10|
+---+---+---+
>>> from pyspark.sql import functions as F
>>> df.select([F.collect_set(x).alias(x) for x in df.columns]).show()
+---------+------+--------+
| A| B| C|
+---------+------+--------+
|[1, 2, 3]|[b, a]|[20, 10]|
+---------+------+--------+
Upvotes: 4