Pandas/Python memory spike while reading 3.2 GB file

Question

So I have been trying to read a 3.2GB file in memory using pandas read_csv function but I kept on running into some sort of memory leak, my memory usage would spike 90%+.

So as alternatives

I tried defining dtype to avoid keeping the data in memory as strings, but saw similar behaviour.
Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
Tried reading line by line ran into the same problem, but really slowly.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.

The file in question is a train.csv file from a kaggle competition grupo bimbo

System info:

RAM: 16GB, Processor: i7 8cores

Let me know if you would like to know anything else.

Thanks :)

EDIT 1: its a memory spike! not a leak (sorry my bad.)

EDIT 2: Sample of the csv file

Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3

EDIT 3: number rows in the file 74180465

Other then a simple pd.read_csv('filename', low_memory=False)

I have tried

from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')

UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.

import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
    data.append(sub_data)
    gc.collect()

EDIT: Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.

dtypes = {'Semana': pd.np.int8,
          'Agencia_ID':pd.np.int8,
          'Canal_ID':pd.np.int8,
          'Ruta_SAK':pd.np.int8,
          'Cliente_ID':pd.np.int8,
          'Producto_ID':pd.np.int8,
          'Venta_uni_hoy':pd.np.int8,
          'Venta_hoy':pd.np.float16,
          'Dev_uni_proxima':pd.np.int8,
          'Dev_proxima':pd.np.float16,
          'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)

This brought down the memory consumption to just under 4Gb

Aaron · Accepted Answer

A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.

>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128

A little bit of testing (assuming __sizeof__ is accurate):

import numpy as np
import pandas as pd

s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'
')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'
')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'
')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'
')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'
')

returns:

string: 53

list: 120

array: 136

byte array: 106

dataframe: 152

Pandas/Python memory spike while reading 3.2 GB file

Answers (2)

Related Questions