Reputation: 62
I want to aggregate some dates (for example one month for each customer) and its data to one row in pyspark.
Example simply as the bellow table
Customer_Id | Date | Data |
---|---|---|
id1 | 2021-01-01 | 2 |
id1 | 2021-01-02 | 3 |
id1 | 2021-01-03 | 4 |
I want to change it into
Customer_Id | Date | col1 | col2 | col3 |
---|---|---|---|---|
id1 | [2021-01-01 - 2021-01-03] | 2 | 3 | 4 |
Upvotes: 0
Views: 169
Reputation: 1892
@matin you can try below code to replicate the output
from pyspark.sql.functions import *
schema = ["Customer_Id","Date","Data"]
data =[["id1", "2021-01-01", 2],["id1","2021-01-02", 3],["id1","2021-01-03", 4]]
df = spark.createDataFrame(data,schema)
df2 = df.groupBy(["Customer_Id"]).agg(collect_list("Date").alias("list_date"),collect_list("data").alias("list_data")
)
df3= df2.withColumn("col1",df2.list_data[0]).withColumn("col2",df2.list_data[1]).withColumn("col3",df2.list_data[2]).drop("list_data")
df3.show(truncate=False)
df3.printSchema()
let me know if you need further modification.
Upvotes: 1