Reputation: 1914
I have a dataset that looks like this:
id | category | value
---+----------+------
1 | a | 3
2 | a | 3
3 | a | 3
3 | b | 1
4 | a | 1
4 | b | abc
The output I want is:
id | category_a | category_b
---+------------+--------
1 | 3 | null
2 | 3 | null
3 | 3 | 1
4 | 1 | abc
It means that it will groupBy
id, category
and creates dummy columns.
How can I transform the input to the expected output?
My approach is:
pivoted_df = df.groupBy("id") \
.pivot("category") \
.agg(F.lit(F.col("value")))
But I got this error:
pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;
Update: The value
column contains non-numeric value also.
For the category
column, each id
will have 2 rows only with respect to 2 categories a, b
.
Upvotes: 0
Views: 594