Reputation: 99
Take the code below for example. Assuming the chunk iterator renders 10 chunks, the for loop will load all of them into memory (one after another) or Python will work efficiently releasing one after the oter?
df_iter = pd.read_csv(file, chunksize=100)
for chunk in df_iter:
chunk.to_sql(table, engine)
I've made some tests with files bigger than memory using the code above and got memory overflow. Do I miss something here?
Upvotes: 1
Views: 594
Reputation: 99
Correct syntax is:
for chunk in pd.read_csv():
*Do something*
And not:
chunks = pd.read_csv()
for chunk in chunk:
*Do something*
Upvotes: 0
Reputation: 4548
I think I'm seeing what you're seeing, where more and more memory gets used in the program as the loop iterates. I wasn't expecting this to be the case. I tried keeping track of current memory using the tracemalloc
library and the memory usage does increase.
I tried to pre-allocate all the memory I'd need outside of the for-loop so no accidental memory accumulation occurred, but I could have made a mistake somehow
import pandas as pd
import numpy as np
#followed example on https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
import tracemalloc
#create an example csv to read back in with chunks
nrows = 1000000
out_df = pd.DataFrame({
'age':np.random.randint(0,10,nrows),
'height':np.random.randint(0,10,nrows),
})
out_df.to_csv('test_out.csv')
chunksize = 10000
pre_alloc_data = {i:0 for i in range(0,nrows//chunksize,10)}
# starting the monitoring
tracemalloc.start()
df_iter = pd.read_csv('test_out.csv', chunksize=chunksize)
for i,chunk in enumerate(df_iter):
# store the current memory usage every 10 iterations
if i%10 == 0:
pre_alloc_data[i] = tracemalloc.get_traced_memory()[0]
# stopping the library
tracemalloc.stop()
print(pre_alloc_data)
Output
{0: 1047984, 10: 1049208, 20: 1049532, 30: 1049832, 40: 1050596, 50: 1051720, 60: 1052692, 70: 1053696, 80: 1054732, 90: 1055800}
Upvotes: 1