Pyspark count for each distinct value in column for multiple columns

Question

I have the following dataframe

+---+-----+-------+---+
| id|state|country|zip|
+---+-----+-------+---+
|  1| AAA |    USA|123|
|  2| XXX |    CHN|234|
|  3| AAA |    USA|123|
|  4| PPP |    USA|222|
|  5| PPP |    USA|222|
|  5| XXX |    CHN|234|
+---+-----+-------+---+

I want to create a flat dataframe with arrays for count of each distinct value in each column, something as follows:

+-------------------------+--------------------+------------------------+
|state                    |country             |zip                     |
+-------------------------+--------------------+------------------------+
|[[AAA, 2],[PPP,2][XXX,2]]|[[USA, 4],[CHN,123]]|[123, 2],[234,2][222,2]]|
+-------------------------+--------------------+------------------------+

Original table has 600+ columns but my aim is to do this for columns which only contain less than 100 unique values in total.

werner · Accepted Answer

You can count the values per column for each column separately and then join the results:

from pyspark.sql import functions as F

df = ...

#get all column names and remove the id column from this list
cols = df.schema.fieldNames()
cols.remove("id")

#for each column count the values
dfs = []
for col in cols:
    dfs.append(df.groupBy(col).count().agg(F.collect_list(F.array(col, "count")).alias(col)))

#combine the results for each column into a single dataset
import functools
result = functools.reduce(lambda l,r:l.crossJoin(r), dfs)
result.show(truncate=False)

Output:

+------------------------------+--------------------+------------------------------+
|state                         |country             |zip                           |
+------------------------------+--------------------+------------------------------+
|[[PPP, 2], [XXX, 2], [AAA, 2]]|[[USA, 4], [CHN, 2]]|[[222, 2], [234, 2], [123, 2]]|
+------------------------------+--------------------+------------------------------+

Pyspark count for each distinct value in column for multiple columns

Answers (1)

Related Questions