Reading a tab separated file with Python Pandas

Question

I have encountered a problem reading a tab separated file using Pandas.

All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:

Column A  Column B  Column C
"foo1"    "121654"  "unit"
"foo2"    "1214"    "unit"
"foo3"    "15884""

The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58

The code I used is:

csv = pd.read_csv(file, sep='	',  lineterminator='
', names=None)

and it works fine for the rest of the files but not for the ones where this extra double quotation appears.

Jean-Fran&#231;ois Fabre · Accepted Answer

If you cannot change the buggy input, the best way would be to read the input file into a io.StringIO object, replacing the double quotes, then pass this file-like object to pd.read (it supports filenames and file-like objects)

That way you don't have to create a temporary file or to alter the input data.

import io

with open(file) as f:
    fileobject = io.StringIO(f.read().replace('""','"'))

csv = pd.read_csv(fileobject, sep='	',  lineterminator='
', names=None)

Reading a tab separated file with Python Pandas

Answers (2)

Related Questions