Severun
Severun

Reputation: 2916

Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331

I have a 17gb tab separated file and I get the above error when using python/pandas

I am doing the following:

data = pd.read_csv('/tmp/testdata.tsv',sep='\t')

I have also tried adding encoding='utf8' and also tried read_table and various flags, including low_memory=True, but I always get the same error at the same line.

I ran the following on the file:

awk -F"\t" 'FNR==1025974 {print NF}' /tmp/testdata.tsv

An it returns 281 for the number of fields so awk is telling me that line has the correct 281 columns, but read_csv is telling me I have 331.

I also tried the above awk on line 1025973 and 1025975, just to be sure something wasn't relative to zero and they both come back as 281 fields.

What am I missing here?

Upvotes: 2

Views: 663

Answers (1)

Severun
Severun

Reputation: 2916

So to debug this, I took my header line, then took the single line from above and ran it through read_csv. I then got another error:

Error tokenizing data. C error: EOF inside string starting at line 1

The problem turned out to be that, by default, read_csv will look for a closing double quote if it sees a double quote immediately after the delimiter.

I incorrectly assumed that if I specified sep="\t" it would split only on tabs and not care about any other characters.

Long story short, to fix this, add the following flag to read_csv

quoting=3 which is QUOTE_NONE.

Upvotes: 1

Related Questions