Restructuring pyspark dataframe

Question

I am solving a regression problem. For that I have cluster the data first and applied regression model on each cluster. Now I want to implement another regression model which will take predicted output of each cluster as a feature and output the aggregated predicted value.

I have already implemented the clustering and regression model in pyspark. But I am not able to finally extract the output of each cluster as a feature for input to another regression model.

How Can this conversion be achieved in pyspark(prefarably) or pandas efficiently?

Current dataframe :

date   cluster  predVal actual
31-03-2019 0     14      13
31-03-2019 1     24      15
31-03-2019 2     13      10
30-03-2019 0     14      13
30-03-2019 1     24      15
30-03-2019 2     13      10

Required dataframe

date       predVal0    predVal1   predVal2    actual
31-03-2019 14          24         13          38  // 13+15+10
30-03-2019 14          24         13          38  // 13+15+10

yannick_leo · Accepted Answer

You want to do a pivot in pyspark and then create a new column by summing the predVal{i} columns. You should proceed in three steps.

First step, you want to apply a pivot. Your index is the date, your column to pivot is the cluster and the column of the value if the predVal.

df_pivot = df.groupBy('date').pivot('cluster').agg(first('predVal'))

Then, you should apply a sum

df_actual = df.groupBy('date').sum('actual')

At the end, you can join the actual column with the pivot data on the index column data:

df_final = df_pivot.join(df_actual ,['date'])

This link is answering pretty well your question: - https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Restructuring pyspark dataframe

Answers (1)

Related Questions