bob
bob

Reputation: 79

How do I group by multiple columns and count in PySpark?

How do I do this analysis in PySpark?
Not sure how to this with groupBy:

Input

ID Rating
AAA 1
AAA 2
BBB 3
BBB 2
AAA 2
BBB 2

Output

ID Rating Frequency
AAA 1 1
AAA 2 2
BBB 2 2
BBB 3 1

Upvotes: 1

Views: 5866

Answers (1)

mck
mck

Reputation: 42332

You can group by both ID and Rating columns:

import pyspark.sql.functions as F

df2 = df.groupBy('ID', 'Rating').agg(F.count('*').alias('Frequency')).orderBy('ID', 'Rating')

Upvotes: 3

Related Questions