Reputation: 79
How do I do this analysis in PySpark?
Not sure how to this with groupBy
:
Input
ID | Rating |
---|---|
AAA | 1 |
AAA | 2 |
BBB | 3 |
BBB | 2 |
AAA | 2 |
BBB | 2 |
Output
ID | Rating | Frequency |
---|---|---|
AAA | 1 | 1 |
AAA | 2 | 2 |
BBB | 2 | 2 |
BBB | 3 | 1 |
Upvotes: 1
Views: 5866
Reputation: 42332
You can group by both ID and Rating columns:
import pyspark.sql.functions as F
df2 = df.groupBy('ID', 'Rating').agg(F.count('*').alias('Frequency')).orderBy('ID', 'Rating')
Upvotes: 3