Reputation: 313
I am currently modifying a pandas dataframe in a loop structure which looks something like this:
for item in item_list:
~~~~ do something to the item ~~~~~
results_df = results_df.append(item)
This code is fine for small items being appended and whenever the results_df is small. However, the items I am appending are reasonably large, and the loop is quite long, which means this loop takes quite a long time to complete due to the large expense of copying the result_df when it becomes large.
One solution I can see is that I could append items to a list in this dictionary, like:
results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in item_list:
item_1, item_2, item_3 = item
~~~~~ do something ~~~~
results_dict['result_1'].append(item_1)
results_dict['result_2'].append(item_2)
results_dict['result_3'].append(item_3)
From the resulting dictionary the dataframe can then be made. This is ok but does not seem optimal. Can anyone think of a better solution? Nb the items in each item in item_list are reasonably large dataframe on which some comoplex processing takes place, and the length of item_list is of the order of 1000
Upvotes: 0
Views: 196
Reputation: 20445
Although you are doing same with the dictionary, according to my understanding, with appending to list as dictionary key-value, you have additional complexity of O(1) for dictionary lookup, for each iteration.
you can make list of or each columns (items in you case) and make a dataframe from these lists
item_1_list = []
item_2_list = []
item_3_list = []
for item in item_list:
item_1, item_2, item_3 = item
item_1_list.append(item_1)
item_2_list.append(item_2)
item_3_list.append(item_3)
df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})
del item_1_list,item_2_list,item_3_list
Although dictionary lookup of O(1) doesnt matter much, but I think you can still be better off with list.
Here are the benchmarks
for your approach
import timeit
start = timeit.default_timer()
results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in range(1000):
results_dict['result_1'].append(item)
results_dict['result_2'].append(item)
results_dict['result_3'].append(item)
df = pd.DataFrame(results_dict)
stop = timeit.default_timer()
print('Time: ', stop - start)
Time it took:
Time: 0.013144109999984721
With this approach
import timeit
start = timeit.default_timer()
item_1_list = []
item_2_list = []
item_3_list = []
for item in range(1000):
item_1_list.append(item)
item_2_list.append(item)
item_3_list.append(item)
df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})
stop = timeit.default_timer()
print('Time: ', stop - start)
Time it took:
Time: 0.005675986999960969
Upvotes: 2