Reputation: 19308
I am getting the same error as this question, but the recommended solution of setting blocksize=None
isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
Here's the error I'm getting:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
Adding blocksize=None
helps sometimes, see here for example, and I'm not sure why it's not solving my issue.
Any suggestions on how to get past this issue?
This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.
Upvotes: 1
Views: 982
Reputation: 16551
The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv
contains an error (one too many commas). This is the offending line (middle) and its neighbours:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
Some of the options are:
on_bad_lines
kwarg to pandas can be set to warn
or skip
(so this should be also possible with dask.dataframe
;
fix the raw file (knowing where the error is) with something like sed
(assuming you can modify the raw files) or on the fly by reading the file line by line.
Upvotes: 1