Carefree
Carefree

Reputation: 25

Read large csv files with Python

I used Dask to read 2.5GB csv file and Python gave me errors. This is the code I wrote:

import pandas as pd
import numpy as np
import time
from dask import dataframe as df1

s_time_dask = time.time()
dask_df = df1.read_csv('3SPACK_N150_7Ah_PressureDistributionStudy_Data_Matrix.csv')
e_time_dask = time.time()

The following is the error I got from Python:

dask_df = df1.read_csv('3SPACK_N150_7Ah_PressureDistributionStudy_Data_Matrix.csv')

File "C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py", line 645, in read return read_pandas(

File "C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py", line 525, in read_pandas head = reader(BytesIO(b_sample), **kwargs)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 686, in read_csv return _read(filepath_or_buffer, kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 458, in _read data = parser.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1196, in read ret = self._engine.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2155, in read data = self._reader.read(nrows)

File "pandas_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read

File "pandas_libs\parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas_libs\parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows

File "pandas_libs\parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas_libs\parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error

ParserError: Error tokenizing data. C error: Expected 1 fields in line 43, saw 9

Can you please help me with this problem?

Thanks

Upvotes: 1

Views: 1398

Answers (3)

Odhian
Odhian

Reputation: 375

If you really need to open all your data, you can do it in chunks so it doesn't take all your memory: read_csv() has an attribute called chunksize. You can see how it works at kite.com.

You can also check the pandas documentation.

Upvotes: 0

mdurant
mdurant

Reputation: 28684

Your error has nothing to do with memory. Dask loads text files like CSVs chunk-wise, by choosing fixed bytes offsets and then scanning from each offset to the nearest newline character. This is so that you can access the same file from multiple processes or even multiple machines, and only work on as many chunks as you have worker threads at a time.

Unfortunately, a newline character doesn't always mean the end of a row, since they can occur within quoted strings of some text field. This means that you essentially cannot read the file with dask's read_csv, unless you preemptively find a set of byte offsets that guarantees clean partitioning without breaking in the middle of a quoted string.

Upvotes: 1

Wes Hardaker
Wes Hardaker

Reputation: 22262

In short: you're out of memory. You're trying to load more data into python than can fit in the memory in your machine (python's memory usage is higher than C/C++/etc, but you'd still hit a limit with those languages too).

To fix this, you probably need to read the file using csvreader instead, where you can read it line by line. Then process the line to take only the columns you want or start any aggregation you want to do on a line by line basis. If you can't do this, then you either need to use a smaller dataset if you really need all of the data in memory at once, or to use a system with more memory.

If your file is 2.5G, it wouldn't surprise me if your system would need ~20GB of memory or so. But the right way to estimate is to load a fixed number of rows, figure out how much your process is using, then read twice that number of rows and look at the memory usage again. Subtract the lower number from the higher and that's likely how much memory (approximately) you need to hold that many rows. You can then calculate how much you need for all the data.

Upvotes: 0

Related Questions