pdangelo4
pdangelo4

Reputation: 230

pyspark pivot without aggregation

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object

As an example have this:

+--------+-----------+-------+----+
|country |code       |Value  |ids |
+--------+-----------+-------+----+
|Mexico  |food_1_3   |apple  |1   |
|Mexico  |food_1_3   |banana |2   |
|Canada  |beverage_2 |milk   |1   |
|Mexico  |beverage_2 |water  |2   |
+--------+-----------+-------+----+

Need this:

+--------+---+---------+-----------+
|country |id |food_1_3 |beverage_2 |
+--------+---+---------+-----------+
|Mexico  |1  |apple    |           |
|Mexico  |2  |banana   |water      |
|Canada  |1  |         |milk       |
|--------+---+---------+-----------+

I have tried:

(
    df
    .groupby("country", "id")
    .pivot("code")
    .agg(
        F.first("Value").alias("Value"),
    )
    .show()
)

but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...

Upvotes: 3

Views: 2909

Answers (1)

mck
mck

Reputation: 42422

That's because your id is not unique. Add a unique index column and that should work:

import pyspark.sql.functions as F

pivoted = (
    df
    .groupby(
        "country",
        "id",
        F.monotonically_increasing_id().alias("index")
    )
    .pivot("code")
    .agg(
        F.first("Value").alias("Value"),
    )
    .drop("index")
)

pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico|  1|      null|   apple|
| Mexico|  2|     water|    null|
| Canada|  1|      milk|    null|
| Mexico|  2|      null|  banana|
+-------+---+----------+--------+

Upvotes: 4

Related Questions