Pinky the mouse
Pinky the mouse

Reputation: 197

Python pandas delimiter misprint - double sign

This is my code to open file:

df = pd.read_csv(path_df, delimiter='|')

I get error: Error tokenizing data. C error: Expected 5 fields in line 13571, saw 6

When I check this particular line, I see that there was a misprint and there were 3 signs "|||" instead of one. I would prefer treat double and triple signs as one. Probably, there is other solution.

How can I solve this problem?

Upvotes: 2

Views: 77

Answers (3)

Stael
Stael

Reputation: 2689

my suspicion is that this would be caused by the file being written incorrectly, if the field was supposed to contain the value "|" but somehow was written incorrectly (csv would normally write a line like that as 1|2|3|"|"|5), but if it was mistakenly written without any escaping it would cause this issue.

In that case I don't think you can solve this with pandas, because the issue is badly formed csv.

If it's a one off you can just edit the file first, perhaps to replace all "|||" with "||" - but again that could have unintended consequences. I've had this trouble before and I don't think there's a better way than manually editing the file (at least pandas gives you the line number to look at!)

On the other hand, if it really is just a repeated character misprint, then the other answer will work fine.

Upvotes: 0

ParvBanks
ParvBanks

Reputation: 1436

Another way to define a delimiter is using sep while reading a CSV in pandas.

df = pd.read_csv(path_df, sep=r'\|+', engine='python')

Whenever you notice 'C error', it requires the forced use of python engine by specifying engine='python' in the arguments.

Upvotes: 3

jezrael
jezrael

Reputation: 862661

Use regex separator [|]+ - one or more |:

import pandas as pd

temp=u"""a|b|c
ss|||s|s
t|g|e"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="[|]+",engine='python')

print (df)
    a  b  c
0  ss  s  s
1   t  g  e

Upvotes: 6

Related Questions