Reputation: 105
I have the following dataframe
+---+-----+-------+---+
| id|state|country|zip|
+---+-----+-------+---+
| 1| AAA | USA|123|
| 2| XXX | CHN|234|
| 3| AAA | USA|123|
| 4| PPP | USA|222|
| 5| PPP | USA|222|
| 5| XXX | CHN|234|
+---+-----+-------+---+
I want to create a flat dataframe with arrays for count of each distinct value in each column, something as follows:
+-------------------------+--------------------+------------------------+
|state |country |zip |
+-------------------------+--------------------+------------------------+
|[[AAA, 2],[PPP,2][XXX,2]]|[[USA, 4],[CHN,123]]|[123, 2],[234,2][222,2]]|
+-------------------------+--------------------+------------------------+
Original table has 600+ columns but my aim is to do this for columns which only contain less than 100 unique values in total.
Upvotes: 2
Views: 700
Reputation: 14845
You can count the values per column for each column separately and then join the results:
from pyspark.sql import functions as F
df = ...
#get all column names and remove the id column from this list
cols = df.schema.fieldNames()
cols.remove("id")
#for each column count the values
dfs = []
for col in cols:
dfs.append(df.groupBy(col).count().agg(F.collect_list(F.array(col, "count")).alias(col)))
#combine the results for each column into a single dataset
import functools
result = functools.reduce(lambda l,r:l.crossJoin(r), dfs)
result.show(truncate=False)
Output:
+------------------------------+--------------------+------------------------------+
|state |country |zip |
+------------------------------+--------------------+------------------------------+
|[[PPP, 2], [XXX, 2], [AAA, 2]]|[[USA, 4], [CHN, 2]]|[[222, 2], [234, 2], [123, 2]]|
+------------------------------+--------------------+------------------------------+
Upvotes: 1