Reputation: 617
I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.
Any recommendations are welcome though.
df = DataFrame()
for startID in range(0, 100000, 1000):
s1 = time.time()
tempdf = DataFrame()
url = f'https://******/products?startId={startID}&size=1000'
r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
jsonList = r.json() # datatype= list, contains= dict
normalized = json_normalize(jsonList)
# type(normal) = pandas.DataFrame
print(startID / 1000) # status indicator
for series in normalized.iterrows():
series = series[1] # iterrows returns tuple (index, series)
offers = series['offers']
series = series.drop(columns='offers')
length = len(offers)
for offer in offers:
n = json_normalize(offer).squeeze() # squeeze() casts DataFrame into Series
concatinated = concat([series, n]).to_frame().transpose()
tempdf = tempdf.append(concatinated, ignore_index=True)
del normalized
df = df.append(tempdf)
f1 = time.time()
print(f1 - s1, ' seconds')
df.to_csv('out.csv')
Upvotes: 41
Views: 41816
Reputation: 613
There is even a more efficient way than the accepted answer.
The idea is to store the data by column instead of rows. This will make pandas reduce the memory, as well as the time needed to create the dataframe.
Here is the benchmark code:
import pandas as pd
import numpy as np
import time
import random
repetitions = 1000000
start_time = time.time()
dictionary_list = []
for _ in range(repetitions):
dictionary_data = {k: random.random() for k in range(30)}
dictionary_list.append(dictionary_data)
end_time = time.time()
print('Execution time for generation [list of dict (row store)] = %.6f seconds' % (end_time-start_time))
start_time = time.time()
df_final1 = pd.DataFrame.from_dict(dictionary_list)
end_time = time.time()
print('Execution time for conversion to pandas [list of dict (row store)] = %.6f seconds' % (end_time-start_time))
start_time = time.time()
list_dictionnary = {k: [] for k in range(30)}
for k in range(30):
for _ in range(repetitions):
list_dictionnary[k].append(random.random())
end_time = time.time()
print('Execution time for generation [dict of list (column store) = %.6f seconds' % (end_time-start_time))
start_time = time.time()
df_final2 = pd.DataFrame(list_dictionnary)
end_time = time.time()
print('Execution time for conversion to pandas [dict of list (column store)] = %.6f seconds' % (end_time-start_time))
print(df_final1.shape)
print(df_final2.shape)
And here are the results:
Execution time for generation [list of dict (row store)] = 3.229172 seconds
Execution time for conversion to pandas [list of dict (row store)] = 3.649741 seconds
Execution time for generation [dict of list (column store) = 4.030437 seconds
Execution time for conversion to pandas [dict of list (column store)] = 1.660122 seconds
(1000000, 30)
(1000000, 30)
Upvotes: 1
Reputation: 5730
As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:
import pandas as pd
import numpy as np
import time
import random
end_value = 10000
Measurement for creating a list of dictionaries and at the end load all into data frame
start_time = time.time()
dictionary_list = []
for i in range(0, end_value, 1):
dictionary_data = {k: random.random() for k in range(30)}
dictionary_list.append(dictionary_data)
df_final = pd.DataFrame.from_dict(dictionary_list)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 0.090153 seconds
Measurements for appending data into list and concat into data frame:
start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
appended_data.append(data)
appended_data = pd.concat(appended_data, axis=0)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 4.183921 seconds
Measurements for appending data frames:
start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
df_final = df_final.append(df)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 11.085888 seconds
Measurements for insert data by usage of loc:
start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
df.loc[i] = list(np.random.randint(0, 100, size=30))
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 21.029176 seconds
Upvotes: 79