Pandas error tokenizing data when field in csv file contains quotation mark

Question

I'm using pandas.read_csv to read a tab delimited file and am running into the error: Error tokenizing data. C error: Expected 364 fields in line 73058, saw 398

After much searching, it seems that the offending entry is: "– SO ,쳌 \ ?Œ ø ,d -L ,ú ,‚ ZO

Removing the quotation mark seems to solve things. I've got a lot of large files with a lot of strange characters in them, so this will no doubt repeat itself. Do I need to remove single quotation marks ahead of time or is there some way around this?

Andy Hayden · Accepted Answer

There is a quoting argument for read_csv:

quoting : int or csv.QUOTE_* instance, default None
    Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
    QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
    Default (None) results in QUOTE_MINIMAL behavior.

These are described in the csv docs.

Try setting quoting=3 (i.e. QUOTE_NONE).

Pandas error tokenizing data when field in csv file contains quotation mark

Answers (1)

Related Questions