Romain Jouin
Romain Jouin

Reputation: 4838

Python = dask Vs pandas, error in read_csv

I've got an error on reading a file with dask, which work with pandas :

import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)

while dask gives me an error :

df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999

Answer : Adding "blocksize=None" make it work :

df = dd.read_csv("./tous_les_docs.csv", blocksize=None)

Upvotes: 0

Views: 982

Answers (1)

Lee
Lee

Reputation: 1427

The documentation says that this could happen

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.

Upvotes: 1

Related Questions