huy
huy

Reputation: 1914

PySpark add multiple columns based on categories from the other column

I have a dataset that looks like this:

id | category | value
---+----------+------
1  | a        | 3
2  | a        | 3
3  | a        | 3
3  | b        | 1
4  | a        | 1
4  | b        | abc

The output I want is:

id | category_a | category_b
---+------------+--------
1  | 3          | null
2  | 3          | null
3  | 3          | 1
4  | 1          | abc

It means that it will groupBy id, category and creates dummy columns.

How can I transform the input to the expected output?

My approach is:

pivoted_df = df.groupBy("id") \
        .pivot("category") \
        .agg(F.lit(F.col("value")))

But I got this error:

pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;

Update: The value column contains non-numeric value also.

For the category column, each id will have 2 rows only with respect to 2 categories a, b.

Upvotes: 0

Views: 594

Answers (1)

过过招
过过招

Reputation: 4244

df = df.groupBy('id').pivot('category').agg(F.first('value'))

Upvotes: 1

Related Questions