Bedi Egilmez
Bedi Egilmez

Reputation: 1542

Reading csv with pandas - dealing with imbalanced rows

I have more than 1 million rows, and there is a very long text field making some of my rows imbalanced. This causes some rows to have more columns than my header. I fixed this with following:

read_csv('filename.csv', error_bad_lines=False)

The problem here is it appears there are some rows witch less columns then my header. This is a problem (some fields shift.)

How can I fix this? Is there a way that (I blame that long text field) to act as a one field?


edit after comment

Field delimiter is comma. When I run df.dtypes all fields but one seems to be object, however I originally have int, and datetime fields, read as objects by pandas.


edit after comment 2

here is header for what I have in .csv id(int),textField(string),id2(char),score(int),type(string),length(int),name(string),datetime(datetime),size(int),email(string)

The main problem is textField area. the others cannot have and foul characers for escaping csv syntax. However textField is created by users, it can be anything in unicode; emojis, non english chars funny quote etc.

Upvotes: 0

Views: 700

Answers (1)

Danny_ds
Danny_ds

Reputation: 11406

The main problem is textField area. the others cannot have and foul characers for escaping csv syntax. However textField is created by users, it can be anything in unicode; emojis, non english chars funny quote etc.

The textField should be surrounded with double quotes, and any quote inside that field has to be escaped with another quote.

Since that field can contain any character, chances are that some of those fields are multiline, which would also explain why some rows have less columns while the other data seems to be valid.

So make sure your parser supports, and is set to use multiline. But this will only work if those fields are properly quoted.

Upvotes: 1

Related Questions