Reputation: 603
I am running a python script which can be roughly summed (semi-psuedo-coded) as follows:
import pandas as pd
for json_file in json_files:
with open(json_file,'r') as fin:
data = fin.readlines()
data_str = '[' + ','.join(x.strip() for x in data) + ']'
df = pd.read_json(data_str)
df.to_pickle('%s.pickle' % json_file)
del df, data, data_str
The process works iteratively creating data frames, saving them each to a unique file. However, my memory seems to get used up during the process, as if del df, data, data_str
does not free up memory (originally, I did not include the del
statement in the code, but I hoped that adding it would resolve the issue -- it did not). During each iteration, approximately the same amount of data is being read into the data frame, approximately 3% of my available memory; as the process iterates, each iteration a there is a reported 3% bump in %MEM
(from ps u | grep [p]ython
in my terminal), and eventually my memory is swamped and the process is killed. My question is how should I change my code/approach so that at each iteration, the memory from the previous iteration is freed?
To note, I'm running Ubuntu 16.04 with Python 3.5.2 via Anaconda.
Thanks in advance for your direction.
Upvotes: 4
Views: 1124
Reputation: 314
In python automatic garbage collection deallocates the variable (pandas DataFrame are also just another object in terms of python). There are different garbage collection strategies that can be tweaked (requires significant learning).
You can manually trigger the garbage collection using
import gc
gc.collect()
But frequent calls to garbage collection is discouraged as it is a costly operation and may affect performance.
Upvotes: 2