Vasilis Vasileiou
Vasilis Vasileiou

Reputation: 527

Reading a tab separated file with Python Pandas

I have encountered a problem reading a tab separated file using Pandas.

All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:

Column A  Column B  Column C
"foo1"    "121654"  "unit"
"foo2"    "1214"    "unit"
"foo3"    "15884""  

The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58

The code I used is:

csv = pd.read_csv(file, sep='\t',  lineterminator='\n', names=None) 

and it works fine for the rest of the files but not for the ones where this extra double quotation appears.

Upvotes: 1

Views: 11176

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

If you cannot change the buggy input, the best way would be to read the input file into a io.StringIO object, replacing the double quotes, then pass this file-like object to pd.read (it supports filenames and file-like objects)

That way you don't have to create a temporary file or to alter the input data.

import io

with open(file) as f:
    fileobject = io.StringIO(f.read().replace('""','"'))

csv = pd.read_csv(fileobject, sep='\t',  lineterminator='\n', names=None)

Upvotes: 1

taras
taras

Reputation: 6914

You can do the preprocessing step to fix the quotation issue:

with open(file, 'r') as fp:
    text = fp.read().replace('""', '"')

with open(file, 'w') as fp:
    fp.write(text)

Upvotes: 1

Related Questions