Appending Spark dataframe iteratively using PySpark in databricks

Question

I have a list of header keys that I need to iterate through and get data from an API. I am creating a temporary dataframe to hold API response and using union to append data from temp dataframe to final dataframe. This code works but it is very slow. Please help me find an efficient solution.

# Created df_final empty dataframe before the for loop

list1 = []

for i in range(0,len(list1)):
    api_header_data = list1['header']

    # Call Api function

    input_data = get_api_function()
    response = postrequest(input_data)
    columns = response.json()["result"]["Headers"]
    data = response.json()["result"]["Data"]
    # Create temp dataframe union it to main dataframe
    df_temp = spark.createDataFrame(data,columns)
    df_final = df_final.union(df_temp)

Vikas Sharma · Accepted Answer

You can collect the data first and then create the dataframe:

# Created df_final empty dataframe before the for loop

list1 = []

all_data = []

for headers in list1:
    api_header_data = headers['header']
    
    # Call Api function

    input_data = get_api_function()
    response = postrequest(input_data)
    columns = response.json()["result"]["Headers"]
    data = response.json()["result"]["Data"]
    all_data.extend(data)

if all_data:
    df_final = spark.createDataFrame(all_data, columns)

Appending Spark dataframe iteratively using PySpark in databricks

Answers (1)

Related Questions