Reputation: 499
I am solving a regression problem. For that I have cluster the data first and applied regression model on each cluster. Now I want to implement another regression model which will take predicted output of each cluster as a feature and output the aggregated predicted value.
I have already implemented the clustering and regression model in pyspark. But I am not able to finally extract the output of each cluster as a feature for input to another regression model.
How Can this conversion be achieved in pyspark(prefarably) or pandas efficiently?
Current dataframe :
date cluster predVal actual
31-03-2019 0 14 13
31-03-2019 1 24 15
31-03-2019 2 13 10
30-03-2019 0 14 13
30-03-2019 1 24 15
30-03-2019 2 13 10
Required dataframe
date predVal0 predVal1 predVal2 actual
31-03-2019 14 24 13 38 // 13+15+10
30-03-2019 14 24 13 38 // 13+15+10
Upvotes: 0
Views: 348
Reputation: 459
You want to do a pivot in pyspark and then create a new column by summing the predVal{i} columns. You should proceed in three steps.
First step, you want to apply a pivot. Your index is the date, your column to pivot is the cluster and the column of the value if the predVal.
df_pivot = df.groupBy('date').pivot('cluster').agg(first('predVal'))
Then, you should apply a sum
df_actual = df.groupBy('date').sum('actual')
At the end, you can join the actual column with the pivot data on the index column data:
df_final = df_pivot.join(df_actual ,['date'])
This link is answering pretty well your question: - https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
Upvotes: 2