Reputation: 230
I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+--------+-----------+-------+----+
|country |code |Value |ids |
+--------+-----------+-------+----+
|Mexico |food_1_3 |apple |1 |
|Mexico |food_1_3 |banana |2 |
|Canada |beverage_2 |milk |1 |
|Mexico |beverage_2 |water |2 |
+--------+-----------+-------+----+
Need this:
+--------+---+---------+-----------+
|country |id |food_1_3 |beverage_2 |
+--------+---+---------+-----------+
|Mexico |1 |apple | |
|Mexico |2 |banana |water |
|Canada |1 | |milk |
|--------+---+---------+-----------+
I have tried:
(
df
.groupby("country", "id")
.pivot("code")
.agg(
F.first("Value").alias("Value"),
)
.show()
)
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...
Upvotes: 3
Views: 2909
Reputation: 42422
That's because your id
is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = (
df
.groupby(
"country",
"id",
F.monotonically_increasing_id().alias("index")
)
.pivot("code")
.agg(
F.first("Value").alias("Value"),
)
.drop("index")
)
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+
Upvotes: 4