Powers
Powers

Reputation: 19308

Dask ParserError: Error tokenizing data when reading CSV

I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:

ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    blocksize=None,
    dtype={
        "tolls_amount": "float64",
        "store_and_fwd_flag": "object",
    },
)

ddf.to_parquet(
    "s3://coiled-datasets/nyc-tlc/2010",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

Here's the error I'm getting:

"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".

Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.

Any suggestions on how to get past this issue?

This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.

Upvotes: 1

Views: 982

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv contains an error (one too many commas). This is the offending line (middle) and its neighbours:

VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001

Some of the options are:

  • on_bad_lines kwarg to pandas can be set to warn or skip (so this should be also possible with dask.dataframe;

  • fix the raw file (knowing where the error is) with something like sed (assuming you can modify the raw files) or on the fly by reading the file line by line.

Upvotes: 1

Related Questions