ohh
ohh

Reputation: 193

Pivot and transpose dataset using PySpark

I have around ~30M of records, containing sales data, looking like this:

item type days_diff placed_orders cancelled_orders
console ps5 -10 8 1
console xbox -8 6 0
console ps5 -5 4 4
console xbox -1 10 7
console xbox 0 2 3
games ps5 -11 48 9
games ps5 -3 2 4
games xbox 5 10 2

I would like to decrease the number of rows, by creating list of lists corresponding to particular item, like this:

item types days_diff placed_orders cancelled_orders
console ['ps5', 'xbox'] [[-10, -5],[-8, -1, 0]] [[8, 4],[6, 10, 2]] [[1, 4],[0, 7, 3]]
games ['ps5' ,'xbox'] [[-11, -3],[5]] [[48, 2],[10]] [[9, 4],[2]]

How can achieve it using PySpark?

Upvotes: 0

Views: 49

Answers (1)

Abdennacer Lachiheb
Abdennacer Lachiheb

Reputation: 4888

You can achieve this by performing 2 groupBy the first on the couple ("item", "type") and then on the column ("item"):

spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
    data=[["console", "ps5", -10, 8, 1], ["console", "xbox", -8, 6, 0],
          ["console", "ps5", -5, 4, 4], ["console", "xbox", -1, 10, 7], ["console", "xbox", 0, 2, 3],
          ["games", "ps5", -11, 48, 9], ["games", "ps5", -3, 2, 4], ["games", "xbox", 5, 10, 2]
          ], schema=["item", "type", "days_diff", "placed_orders", "cancelled_orders"])
df = df.groupBy("item", "type").agg(
    collect_list("days_diff").alias("days_diff"),
    collect_list("placed_orders").alias("placed_orders"),
    collect_list("cancelled_orders").alias("cancelled_orders")
)
df = df.groupBy("item").agg(
    collect_list("type").alias("types"),
    collect_list("days_diff").alias("days_diff"),
    collect_list("placed_orders").alias("placed_orders"),
    collect_list("cancelled_orders").alias("cancelled_orders")
)
df.show(10, False)

+-------+-----------+------------------------+--------------------+-------------------+
|item   |types      |days_diff               |placed_orders       |cancelled_orders   |
+-------+-----------+------------------------+--------------------+-------------------+
|console|[ps5, xbox]|[[-10, -5], [-8, -1, 0]]|[[8, 4], [6, 10, 2]]|[[1, 4], [0, 7, 3]]|
|games  |[ps5, xbox]|[[-11, -3], [5]]        |[[48, 2], [10]]     |[[9, 4], [2]]      |
+-------+-----------+------------------------+--------------------+-------------------+

Upvotes: 1

Related Questions