Reputation: 200
So I have been trying to read a 3.2GB file in memory using pandas read_csv
function but I kept on running into some sort of memory leak, my memory usage would spike 90%+
.
So as alternatives
I tried defining dtype
to avoid keeping the data in memory as strings, but saw similar behaviour.
Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
Tried reading line by line ran into the same problem, but really slowly.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.
The file in question is a train.csv file from a kaggle competition grupo bimbo
System info:
RAM: 16GB, Processor: i7 8cores
Let me know if you would like to know anything else.
Thanks :)
EDIT 1: its a memory spike! not a leak (sorry my bad.)
EDIT 2: Sample of the csv file
Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3
EDIT 3: number rows in the file 74180465
Other then a simple pd.read_csv('filename', low_memory=False)
I have tried
from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')
UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.
import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
data.append(sub_data)
gc.collect()
EDIT: Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.
dtypes = {'Semana': pd.np.int8,
'Agencia_ID':pd.np.int8,
'Canal_ID':pd.np.int8,
'Ruta_SAK':pd.np.int8,
'Cliente_ID':pd.np.int8,
'Producto_ID':pd.np.int8,
'Venta_uni_hoy':pd.np.int8,
'Venta_hoy':pd.np.float16,
'Dev_uni_proxima':pd.np.int8,
'Dev_proxima':pd.np.float16,
'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)
This brought down the memory consumption to just under 4Gb
Upvotes: 6
Views: 3219
Reputation: 61289
Based on your second chart, it looks as though there's a brief period in time where your machine allocates an additional 4.368 GB of memory, which is approximately the size of your 3.2 GB dataset (assuming 1GB overhead, which might be a stretch).
I tried to track down a place where this could happen and haven't been super successful. Perhaps you can find it, though, if you're motivated. Here's the path I took:
This line reads:
def read(self, nrows=None):
if nrows is not None:
if self.options.get('skip_footer'):
raise ValueError('skip_footer not supported for iteration')
ret = self._engine.read(nrows)
Here, _engine
references PythonParser
.
That, in turn, calls _get_lines()
.
That makes calls to a data source
.
Which looks like it reads in in the form of strings from something relatively standard (see here), like TextIOWrapper.
So things are getting read in as standard text and converted, this explains the slow ramp.
What about the spike? I think that's explained by these lines:
ret = self._engine.read(nrows)
if self.options.get('as_recarray'):
return ret
# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)
df = DataFrame(col_dict, columns=columns, index=index)
ret
becomes all the components of a data frame`.
self._create_index()
breaks ret
apart into these components:
def _create_index(self, ret):
index, columns, col_dict = ret
return index, columns, col_dict
So far, everything can be done by reference, and the call to DataFrame()
continues that trend (see here).
So, if my theory is correct, DataFrame()
is either copying the data somewhere, or _engine.read()
is doing so somewhere along the path I've identified.
Upvotes: 1
Reputation: 11075
A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__
). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.
>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128
A little bit of testing (assuming __sizeof__
is accurate):
import numpy as np
import pandas as pd
s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')
returns:
string: 53
list: 120
array: 136
byte array: 106
dataframe: 152
Upvotes: 2