Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331

Question

I have a 17gb tab separated file and I get the above error when using python/pandas

I am doing the following:

data = pd.read_csv('/tmp/testdata.tsv',sep='	')

I have also tried adding encoding='utf8' and also tried read_table and various flags, including low_memory=True, but I always get the same error at the same line.

I ran the following on the file:

awk -F"	" 'FNR==1025974 {print NF}' /tmp/testdata.tsv

An it returns 281 for the number of fields so awk is telling me that line has the correct 281 columns, but read_csv is telling me I have 331.

I also tried the above awk on line 1025973 and 1025975, just to be sure something wasn't relative to zero and they both come back as 281 fields.

What am I missing here?

Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331

Answers (1)

Related Questions