Matin Shokri
Matin Shokri

Reputation: 62

How to aggregate some dates and data which belong to into one row in pyspark?

I want to aggregate some dates (for example one month for each customer) and its data to one row in pyspark.

Example simply as the bellow table

Customer_Id Date Data
id1 2021-01-01 2
id1 2021-01-02 3
id1 2021-01-03 4

I want to change it into

Customer_Id Date col1 col2 col3
id1 [2021-01-01 - 2021-01-03] 2 3 4

Upvotes: 0

Views: 169

Answers (1)

Mahesh Gupta
Mahesh Gupta

Reputation: 1892

@matin you can try below code to replicate the output

from pyspark.sql.functions import *
schema = ["Customer_Id","Date","Data"]

data =[["id1",  "2021-01-01",   2],["id1","2021-01-02", 3],["id1","2021-01-03", 4]]
df = spark.createDataFrame(data,schema)
df2 = df.groupBy(["Customer_Id"]).agg(collect_list("Date").alias("list_date"),collect_list("data").alias("list_data")
                                                        )
df3= df2.withColumn("col1",df2.list_data[0]).withColumn("col2",df2.list_data[1]).withColumn("col3",df2.list_data[2]).drop("list_data")
df3.show(truncate=False)
df3.printSchema()

let me know if you need further modification.

Upvotes: 1

Related Questions