dsl1990
dsl1990

Reputation: 1209

Pandas error while reading 14 GB csv file on 200 GB RAM workstation

This is my code to generate files by home id. Then I will analyze each home seperately.

import pandas as pd
data = pd.read_csv("110homes.csv")
for i in (np.unique(data['dataid'])):
    print i
    d1 = pd.DataFrame(data[data['dataid']==i])
    k = str(i)
    d1.to_csv(k + ".csv")

However, I am getting this error. The machine has 200 GB RAM and it is showing memory error too:

    data = pd.read_csv("110homes.csv")
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 260, in _read
    return parser.read()
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 1170, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7544)
  File "pandas/parser.pyx", line 819, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8137)
  File "pandas/parser.pyx", line 1833, in pandas.parser._concatenate_chunks (pandas/parser.c:22383)
MemoryError

Upvotes: 1

Views: 669

Answers (1)

SRobertJames
SRobertJames

Reputation: 9263

Data in RAM can take a lot more space than on disk. Without seeing your 110homes.csv file, it's impossible to know details, but imagine that it consists of 10 floating point numbers per line, like: 0.0,1.0,2.0,.... In the CSV, each takes 3 bytes + 1 byte for the delimiter. In Python, each takes 8 bytes (on a 64 byte machine) for the float, plus 2 bytes per Unicode char (another 8 bytes), plus 8 bytes for string length, plus 8 bytes per pointer, plus bytes per row, etc.

Think about it like this: On a 64 bit machine, the minimum size for a pointer, a native int, or a native float, is 8 bytes. You need several of those per field, and several more per row. There's nothing unusual about taking 15x in RAM versus disk.

Do a simple test: Take the first 10% of the lines of your file, and monitor python via top as it processes. See how much RAM it uses. Does it use at least 20 GB?

Upvotes: 1

Related Questions