Reputation: 2916
I have a 17gb tab separated file and I get the above error when using python/pandas
I am doing the following:
data = pd.read_csv('/tmp/testdata.tsv',sep='\t')
I have also tried adding encoding='utf8' and also tried read_table and various flags, including low_memory=True, but I always get the same error at the same line.
I ran the following on the file:
awk -F"\t" 'FNR==1025974 {print NF}' /tmp/testdata.tsv
An it returns 281 for the number of fields so awk is telling me that line has the correct 281 columns, but read_csv is telling me I have 331.
I also tried the above awk on line 1025973 and 1025975, just to be sure something wasn't relative to zero and they both come back as 281 fields.
What am I missing here?
Upvotes: 2
Views: 663
Reputation: 2916
So to debug this, I took my header line, then took the single line from above and ran it through read_csv. I then got another error:
Error tokenizing data. C error: EOF inside string starting at line 1
The problem turned out to be that, by default, read_csv will look for a closing double quote if it sees a double quote immediately after the delimiter.
I incorrectly assumed that if I specified sep="\t" it would split only on tabs and not care about any other characters.
Long story short, to fix this, add the following flag to read_csv
quoting=3 which is QUOTE_NONE.
Upvotes: 1