Reputation: 99

Does Pandas release the last chunk from memory after loading the next one?

Take the code below for example. Assuming the chunk iterator renders 10 chunks, the for loop will load all of them into memory (one after another) or Python will work efficiently releasing one after the oter?

df_iter = pd.read_csv(file, chunksize=100)

for chunk in df_iter:
  chunk.to_sql(table, engine)

I've made some tests with files bigger than memory using the code above and got memory overflow. Do I miss something here?

Upvotes: 1

Answers (2)

Oliveira

Reputation: 99

Correct syntax is:

for chunk in pd.read_csv():
    *Do something*

And not:

chunks = pd.read_csv()
for chunk in chunk:
    *Do something*

Upvotes: 0

mitoRibo

Reputation: 4548

I think I'm seeing what you're seeing, where more and more memory gets used in the program as the loop iterates. I wasn't expecting this to be the case. I tried keeping track of current memory using the tracemalloc library and the memory usage does increase.

I tried to pre-allocate all the memory I'd need outside of the for-loop so no accidental memory accumulation occurred, but I could have made a mistake somehow

import pandas as pd
import numpy as np

#followed example on https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
import tracemalloc

#create an example csv to read back in with chunks
nrows = 1000000
out_df = pd.DataFrame({
    'age':np.random.randint(0,10,nrows),
    'height':np.random.randint(0,10,nrows),
})
out_df.to_csv('test_out.csv')

chunksize = 10000
pre_alloc_data = {i:0 for i in range(0,nrows//chunksize,10)}

# starting the monitoring 
tracemalloc.start()

df_iter = pd.read_csv('test_out.csv', chunksize=chunksize)


for i,chunk in enumerate(df_iter):
    # store the current memory usage every 10 iterations
    if i%10 == 0:
        pre_alloc_data[i] = tracemalloc.get_traced_memory()[0]
            
# stopping the library
tracemalloc.stop()

print(pre_alloc_data)

Output

{0: 1047984, 10: 1049208, 20: 1049532, 30: 1049832, 40: 1050596, 50: 1051720, 60: 1052692, 70: 1053696, 80: 1054732, 90: 1055800}

Upvotes: 1

Does Pandas release the last chunk from memory after loading the next one?

Answers (2)

Related Questions