PySpark : How to aggregate on a column with count of the different

Question

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant	state
ID01	NY
ID02	NY
ID01	CA
ID03	CA
ID01	CA
ID03	NY
ID01	NY
ID01	CA
ID01	NY

I'd like to obtain this dataset:

Identifiant	NY	CA
ID01	3	3
ID02	1	0
ID03	1	1

blackbishop · Accepted Answer

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
#+-----------+---+---+

PySpark : How to aggregate on a column with count of the different

Answers (1)

Related Questions