emma19
emma19

Reputation: 57

How to append a pyspark dataframes inside a for loop?

example: I have a pyspark dataframe as:

df=
    x_data  y_data    
    2.5      1.5       
    3.5      8.5
    4.5      89.5
    5.5      20.5

Let's say have some calculation to be done on each column on df which I do inside a for loop. After that my final output should be like this:

df_output= 
       cal_1 cal_2 Cal_3 Cal_4   Datatype
        23    24   34     36       x_data
        12    13   18     90       x_data
        23    54   74     96       x_data
        41    13   38     50       x_data
        53    74   44      6       y_data
        72    23   28     50       y_data
        43    24   44     66       y_data
        41    23   58     30       y_data

How do I append these results calculated on each column into the same pyspark output data frame inside the for loop?

Upvotes: 0

Views: 6475

Answers (1)

blackbishop
blackbishop

Reputation: 32640

You can use functools.reduce to union the list of dataframes created in each iteration.

Something like this :

import functools
from pyspark.sql import DataFrame

output_dfs = []

for c in df.columns:
    # do some calculation
    df_output = _  # calculation result

    output_dfs.append(df_output)

df_output = functools.reduce(DataFrame.union, output_dfs)

Upvotes: 2

Related Questions