Reputation: 17656
Something that I run into quite often is an error like
>> dd.read_csv('/tmp/*.csv', parse_dates=['start_time', 'end_time'])
Traceback (most recent call last):
...
File "/Users/brettnaul/venvs/model37/lib/python3.6/site-packages/dask/dataframe/io/csv.py", line 163, in coerce_dtypes
raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
The following columns failed to properly parse as dates:
- start_time
- end_time
This is usually due to an invalid value in that column. To
diagnose and fix it's recommended to drop these columns from the
`parse_dates` keyword, and manually convert them to dates later
using `dd.to_datetime`.
Clearly one of my files is misformatted, but which one? The best solution I've come up with so far is:
This seems awfully roundabout to me but unless I'm missing something obvious it doesn't seem like there's any other identifying information available in the traceback. Is there a better way to figure out which file is failing? Using collection=False
and inspecting the Delayed
objects might also work but I'm not exactly sure what to look for. Is there any way that the exception raised can include some hint as to where the problem occurred or is that information not available once read_csv
is being called?
Upvotes: 2
Views: 835
Reputation: 18221
One approach could be to include the file names when reading the files, deferring the date parsing (just following the suggestion in the error message), treating errors as NaT
s, and picking out the problematic ones in the result. In the example below, 2.csv
and 3.csv
contain problematic values:
In [45]: !cat 1.csv
a
2018-01-01
2018-01-02
In [46]: !cat 2.csv
a
2018-01-03
2018-98-04
In [47]: !cat 3.csv
a
2018-01-05b
2018-01-06
In [48]: !cat 4.csv
a
2018-01-07
2018-01-08
In [49]: df = dd.read_csv('*.csv', include_path_column=True)
In [50]: df['a'] = dd.to_datetime(df.a, errors='coerce')
In [51]: df[df['a'].isnull()].path.compute()
Out[51]:
1 2.csv
0 3.csv
In particular, this tells us that the second row (indexed 1) in 2.csv
and the first row (indexed 0) in 3.csv
are the culprits.
Upvotes: 1