Mikesama
Mikesama

Reputation: 400

Appending Spark dataframe iteratively using PySpark in databricks

I have a list of header keys that I need to iterate through and get data from an API. I am creating a temporary dataframe to hold API response and using union to append data from temp dataframe to final dataframe. This code works but it is very slow. Please help me find an efficient solution.

# Created df_final empty dataframe before the for loop

list1 = [<contains list of lists of header data>]

for i in range(0,len(list1)):
    api_header_data = list1['header']

    # Call Api function

    input_data = get_api_function(<api_header_data>)
    response = postrequest(input_data)
    columns = response.json()["result"]["Headers"]
    data = response.json()["result"]["Data"]
    # Create temp dataframe union it to main dataframe
    df_temp = spark.createDataFrame(data,columns)
    df_final = df_final.union(df_temp)

Upvotes: 2

Views: 46

Answers (1)

Vikas Sharma
Vikas Sharma

Reputation: 2147

You can collect the data first and then create the dataframe:

# Created df_final empty dataframe before the for loop

list1 = [<contains list of lists of header data>]

all_data = []

for headers in list1:
    api_header_data = headers['header']
    
    # Call Api function

    input_data = get_api_function(<api_header_data>)
    response = postrequest(input_data)
    columns = response.json()["result"]["Headers"]
    data = response.json()["result"]["Data"]
    all_data.extend(data)

if all_data:
    df_final = spark.createDataFrame(all_data, columns)

Upvotes: 1

Related Questions