pyspark pivot without aggregation

Question

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object

As an example have this:

+--------+-----------+-------+----+
|country |code       |Value  |ids |
+--------+-----------+-------+----+
|Mexico  |food_1_3   |apple  |1   |
|Mexico  |food_1_3   |banana |2   |
|Canada  |beverage_2 |milk   |1   |
|Mexico  |beverage_2 |water  |2   |
+--------+-----------+-------+----+

Need this:

+--------+---+---------+-----------+
|country |id |food_1_3 |beverage_2 |
+--------+---+---------+-----------+
|Mexico  |1  |apple    |           |
|Mexico  |2  |banana   |water      |
|Canada  |1  |         |milk       |
|--------+---+---------+-----------+

I have tried:

(
    df
    .groupby("country", "id")
    .pivot("code")
    .agg(
        F.first("Value").alias("Value"),
    )
    .show()
)

but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...

pyspark pivot without aggregation

Answers (1)

Related Questions