tam63
tam63

Reputation: 313

Faster method to append in pandas

I am currently modifying a pandas dataframe in a loop structure which looks something like this:

for item in item_list:
    
    ~~~~ do something to the item ~~~~~

    results_df = results_df.append(item)

This code is fine for small items being appended and whenever the results_df is small. However, the items I am appending are reasonably large, and the loop is quite long, which means this loop takes quite a long time to complete due to the large expense of copying the result_df when it becomes large.

One solution I can see is that I could append items to a list in this dictionary, like:

results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in item_list:
    item_1, item_2, item_3 = item

    ~~~~~ do something ~~~~

    results_dict['result_1'].append(item_1)
    results_dict['result_2'].append(item_2)
    results_dict['result_3'].append(item_3)

From the resulting dictionary the dataframe can then be made. This is ok but does not seem optimal. Can anyone think of a better solution? Nb the items in each item in item_list are reasonably large dataframe on which some comoplex processing takes place, and the length of item_list is of the order of 1000

Upvotes: 0

Views: 196

Answers (1)

A.B
A.B

Reputation: 20445

Although you are doing same with the dictionary, according to my understanding, with appending to list as dictionary key-value, you have additional complexity of O(1) for dictionary lookup, for each iteration.

you can make list of or each columns (items in you case) and make a dataframe from these lists

item_1_list = []
item_2_list = []
item_3_list = []

for item in item_list:
    item_1, item_2, item_3 = item
    
    item_1_list.append(item_1)
    item_2_list.append(item_2)
    item_3_list.append(item_3)

df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})
del item_1_list,item_2_list,item_3_list

Although dictionary lookup of O(1) doesnt matter much, but I think you can still be better off with list.

Here are the benchmarks

for your approach

import timeit

start = timeit.default_timer()

results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in range(1000):
    

    

    results_dict['result_1'].append(item)
    results_dict['result_2'].append(item)
    results_dict['result_3'].append(item)
df = pd.DataFrame(results_dict)
stop = timeit.default_timer()

print('Time: ', stop - start) 

Time it took:

Time:  0.013144109999984721

With this approach

import timeit

start = timeit.default_timer()

item_1_list = []
item_2_list = []
item_3_list = []

for item in range(1000):
    
    
    item_1_list.append(item)
    item_2_list.append(item)
    item_3_list.append(item)

df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})


stop = timeit.default_timer()

print('Time: ', stop - start)  

Time it took:

Time:  0.005675986999960969

Upvotes: 2

Related Questions