bnaul
bnaul

Reputation: 17656

Which file is causing `dask.dataframe.read_csv` to fail?

Something that I run into quite often is an error like

>> dd.read_csv('/tmp/*.csv', parse_dates=['start_time', 'end_time'])

Traceback (most recent call last):
...
  File "/Users/brettnaul/venvs/model37/lib/python3.6/site-packages/dask/dataframe/io/csv.py", line 163, in coerce_dtypes
    raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

The following columns failed to properly parse as dates:

- start_time
- end_time

This is usually due to an invalid value in that column. To
diagnose and fix it's recommended to drop these columns from the
`parse_dates` keyword, and manually convert them to dates later
using `dd.to_datetime`.

Clearly one of my files is misformatted, but which one? The best solution I've come up with so far is:

This seems awfully roundabout to me but unless I'm missing something obvious it doesn't seem like there's any other identifying information available in the traceback. Is there a better way to figure out which file is failing? Using collection=False and inspecting the Delayed objects might also work but I'm not exactly sure what to look for. Is there any way that the exception raised can include some hint as to where the problem occurred or is that information not available once read_csv is being called?

Upvotes: 2

Views: 835

Answers (1)

fuglede
fuglede

Reputation: 18221

One approach could be to include the file names when reading the files, deferring the date parsing (just following the suggestion in the error message), treating errors as NaTs, and picking out the problematic ones in the result. In the example below, 2.csv and 3.csv contain problematic values:

In [45]: !cat 1.csv
a
2018-01-01
2018-01-02

In [46]: !cat 2.csv
a
2018-01-03
2018-98-04

In [47]: !cat 3.csv
a
2018-01-05b
2018-01-06

In [48]: !cat 4.csv
a
2018-01-07
2018-01-08

In [49]: df = dd.read_csv('*.csv', include_path_column=True)

In [50]: df['a'] = dd.to_datetime(df.a, errors='coerce')

In [51]: df[df['a'].isnull()].path.compute()
Out[51]: 
1    2.csv
0    3.csv

In particular, this tells us that the second row (indexed 1) in 2.csv and the first row (indexed 0) in 3.csv are the culprits.

Upvotes: 1

Related Questions