Reputation: 193
I have around ~30M of records, containing sales data, looking like this:
item | type | days_diff | placed_orders | cancelled_orders |
---|---|---|---|---|
console | ps5 | -10 | 8 | 1 |
console | xbox | -8 | 6 | 0 |
console | ps5 | -5 | 4 | 4 |
console | xbox | -1 | 10 | 7 |
console | xbox | 0 | 2 | 3 |
games | ps5 | -11 | 48 | 9 |
games | ps5 | -3 | 2 | 4 |
games | xbox | 5 | 10 | 2 |
I would like to decrease the number of rows, by creating list of lists corresponding to particular item, like this:
item | types | days_diff | placed_orders | cancelled_orders |
---|---|---|---|---|
console | ['ps5', 'xbox'] | [[-10, -5],[-8, -1, 0]] | [[8, 4],[6, 10, 2]] | [[1, 4],[0, 7, 3]] |
games | ['ps5' ,'xbox'] | [[-11, -3],[5]] | [[48, 2],[10]] | [[9, 4],[2]] |
How can achieve it using PySpark?
Upvotes: 0
Views: 49
Reputation: 4888
You can achieve this by performing 2 groupBy the first on the couple ("item", "type") and then on the column ("item"):
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
data=[["console", "ps5", -10, 8, 1], ["console", "xbox", -8, 6, 0],
["console", "ps5", -5, 4, 4], ["console", "xbox", -1, 10, 7], ["console", "xbox", 0, 2, 3],
["games", "ps5", -11, 48, 9], ["games", "ps5", -3, 2, 4], ["games", "xbox", 5, 10, 2]
], schema=["item", "type", "days_diff", "placed_orders", "cancelled_orders"])
df = df.groupBy("item", "type").agg(
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df = df.groupBy("item").agg(
collect_list("type").alias("types"),
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df.show(10, False)
+-------+-----------+------------------------+--------------------+-------------------+
|item |types |days_diff |placed_orders |cancelled_orders |
+-------+-----------+------------------------+--------------------+-------------------+
|console|[ps5, xbox]|[[-10, -5], [-8, -1, 0]]|[[8, 4], [6, 10, 2]]|[[1, 4], [0, 7, 3]]|
|games |[ps5, xbox]|[[-11, -3], [5]] |[[48, 2], [10]] |[[9, 4], [2]] |
+-------+-----------+------------------------+--------------------+-------------------+
Upvotes: 1