Erdal Dogan
Erdal Dogan

Reputation: 617

What is the fastest and most efficient way to append rows to a DataFrame?

I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.

Any recommendations are welcome though.


df = DataFrame()

for startID in range(0, 100000, 1000):
    s1 = time.time()
    tempdf = DataFrame()
    url = f'https://******/products?startId={startID}&size=1000'

    r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
    jsonList = r.json()  # datatype= list, contains= dict

    normalized = json_normalize(jsonList)
    # type(normal) = pandas.DataFrame
    print(startID / 1000) # status indicator
    for series in normalized.iterrows():  
        series = series[1] # iterrows returns tuple (index, series)
        offers = series['offers']
        series = series.drop(columns='offers')
        length = len(offers)

        for offer in offers:
            n = json_normalize(offer).squeeze()  # squeeze() casts DataFrame into Series
            concatinated = concat([series, n]).to_frame().transpose()
            tempdf = tempdf.append(concatinated, ignore_index=True)

    del normalized
    df = df.append(tempdf)
    f1 = time.time()
    print(f1 - s1, ' seconds')

df.to_csv('out.csv')

Upvotes: 41

Views: 41816

Answers (2)

Thomas
Thomas

Reputation: 613

There is even a more efficient way than the accepted answer.

The idea is to store the data by column instead of rows. This will make pandas reduce the memory, as well as the time needed to create the dataframe.

Here is the benchmark code:

import pandas as pd
import numpy as np
import time
import random

repetitions = 1000000

start_time = time.time()
dictionary_list = []
for _ in range(repetitions):
    dictionary_data = {k: random.random() for k in range(30)}
    dictionary_list.append(dictionary_data)
end_time = time.time()
print('Execution time for generation [list of dict (row store)] = %.6f seconds' % (end_time-start_time))

start_time = time.time()
df_final1 = pd.DataFrame.from_dict(dictionary_list)
end_time = time.time()
print('Execution time for conversion to pandas [list of dict (row store)] = %.6f seconds' % (end_time-start_time))

start_time = time.time()
list_dictionnary = {k: [] for k in range(30)}
for k in range(30):
    for _ in range(repetitions):
        list_dictionnary[k].append(random.random())
end_time = time.time()
print('Execution time for generation [dict of list (column store) = %.6f seconds' % (end_time-start_time))

start_time = time.time()
df_final2 = pd.DataFrame(list_dictionnary)
end_time = time.time()
print('Execution time for conversion to pandas [dict of list (column store)] = %.6f seconds' % (end_time-start_time))

print(df_final1.shape)
print(df_final2.shape)

And here are the results:

Execution time for generation [list of dict (row store)] = 3.229172 seconds
Execution time for conversion to pandas [list of dict (row store)] = 3.649741 seconds
Execution time for generation [dict of list (column store) = 4.030437 seconds
Execution time for conversion to pandas [dict of list (column store)] = 1.660122 seconds
(1000000, 30)
(1000000, 30)

Upvotes: 1

Zaraki Kenpachi
Zaraki Kenpachi

Reputation: 5730

As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

import pandas as pd
import numpy as np
import time
import random

end_value = 10000

Measurement for creating a list of dictionaries and at the end load all into data frame

start_time = time.time()
dictionary_list = []
for i in range(0, end_value, 1):
    dictionary_data = {k: random.random() for k in range(30)}
    dictionary_list.append(dictionary_data)

df_final = pd.DataFrame.from_dict(dictionary_list)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 0.090153 seconds

Measurements for appending data into list and concat into data frame:

start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
    data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    appended_data.append(data)

appended_data = pd.concat(appended_data, axis=0)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 4.183921 seconds

Measurements for appending data frames:

start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
    df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    df_final = df_final.append(df)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 11.085888 seconds

Measurements for insert data by usage of loc:

start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
    df.loc[i] = list(np.random.randint(0, 100, size=30))


end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 21.029176 seconds

Upvotes: 79

Related Questions