Reputation: 75
I am working on a pyspark dataframe which looks like below
id | category |
---|---|
1 | A |
1 | A |
1 | B |
2 | B |
2 | A |
3 | B |
3 | B |
3 | B |
I want to unstack the category column and count their occurrences. So, the result I want is shown below
id | A | B |
---|---|---|
1 | 2 | 1 |
2 | 1 | 1 |
3 | Null | 3 |
I tried finding something on the internet that can help me but I couldn't find anything that could give me this specific result.
Upvotes: 7
Views: 5738
Reputation: 391
Short version, dont have to do multiple groupBy's
df.groupBy("id").pivot("category").count().show()
Upvotes: 7
Reputation: 767
Try this -- (Not sure its optimized)
df = spark.createDataFrame([(1,'A'),(1,'A'),(1,'B'),(2,'B'),(2,'A'),(3,'B'),(3,'B'),(3,'B')],['id','category'])
df = df.groupBy('id','category').count()
df.groupBy('id').pivot('category').sum('count').show()
Upvotes: 4