elokema
elokema

Reputation: 159

PySpark : How to aggregate on a column with count of the different

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant state
ID01 NY
ID02 NY
ID01 CA
ID03 CA
ID01 CA
ID03 NY
ID01 NY
ID01 CA
ID01 NY

I'd like to obtain this dataset:

Identifiant NY CA
ID01 3 3
ID02 1 0
ID03 1 1

Upvotes: 0

Views: 772

Answers (1)

blackbishop
blackbishop

Reputation: 32660

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
#+-----------+---+---+

Upvotes: 1

Related Questions