Which file is causing `dask.dataframe.read_csv` to fail?

Question

Something that I run into quite often is an error like

>> dd.read_csv('/tmp/*.csv', parse_dates=['start_time', 'end_time'])

Traceback (most recent call last):
...
  File "/Users/brettnaul/venvs/model37/lib/python3.6/site-packages/dask/dataframe/io/csv.py", line 163, in coerce_dtypes
    raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

The following columns failed to properly parse as dates:

- start_time
- end_time

This is usually due to an invalid value in that column. To
diagnose and fix it's recommended to drop these columns from the
`parse_dates` keyword, and manually convert them to dates later
using `dd.to_datetime`.

Clearly one of my files is misformatted, but which one? The best solution I've come up with so far is:

Re-run the same command in IPython
%debug magic
print a sample of the raw CSV text to the console
find a unique-ish bit of text and grep until I figure out the problematic file

This seems awfully roundabout to me but unless I'm missing something obvious it doesn't seem like there's any other identifying information available in the traceback. Is there a better way to figure out which file is failing? Using collection=False and inspecting the Delayed objects might also work but I'm not exactly sure what to look for. Is there any way that the exception raised can include some hint as to where the problem occurred or is that information not available once read_csv is being called?

Which file is causing `dask.dataframe.read_csv` to fail?

Answers (1)

Related Questions