Reputation: 527
I have encountered a problem reading a tab separated file using Pandas.
All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:
Column A Column B Column C
"foo1" "121654" "unit"
"foo2" "1214" "unit"
"foo3" "15884""
The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58
The code I used is:
csv = pd.read_csv(file, sep='\t', lineterminator='\n', names=None)
and it works fine for the rest of the files but not for the ones where this extra double quotation appears.
Upvotes: 1
Views: 11176
Reputation: 140168
If you cannot change the buggy input, the best way would be to read the input file into a io.StringIO
object, replacing the double quotes, then pass this file-like object to pd.read
(it supports filenames and file-like objects)
That way you don't have to create a temporary file or to alter the input data.
import io
with open(file) as f:
fileobject = io.StringIO(f.read().replace('""','"'))
csv = pd.read_csv(fileobject, sep='\t', lineterminator='\n', names=None)
Upvotes: 1
Reputation: 6914
You can do the preprocessing step to fix the quotation issue:
with open(file, 'r') as fp:
text = fp.read().replace('""', '"')
with open(file, 'w') as fp:
fp.write(text)
Upvotes: 1