Reputation: 1542
I have more than 1 million rows, and there is a very long text field making some of my rows imbalanced. This causes some rows to have more columns than my header. I fixed this with following:
read_csv('filename.csv', error_bad_lines=False)
The problem here is it appears there are some rows witch less columns then my header. This is a problem (some fields shift.)
How can I fix this? Is there a way that (I blame that long text field) to act as a one field?
edit after comment
Field delimiter is comma.
When I run df.dtypes
all fields but one seems to be object, however I originally have int, and datetime fields, read as objects by pandas.
edit after comment 2
here is header for what I have in .csv id(int),textField(string),id2(char),score(int),type(string),length(int),name(string),datetime(datetime),size(int),email(string)
The main problem is textField area. the others cannot have and foul characers for escaping csv syntax. However textField is created by users, it can be anything in unicode; emojis, non english chars funny quote etc.
Upvotes: 0
Views: 700
Reputation: 11406
The main problem is
textField
area. the others cannot have and foul characers for escaping csv syntax. However textField is created by users, it can be anything in unicode; emojis, non english chars funny quote etc.
The textField
should be surrounded with double quotes, and any quote inside that field has to be escaped with another quote.
Since that field can contain any character, chances are that some of those fields are multiline, which would also explain why some rows have less columns while the other data seems to be valid.
So make sure your parser supports, and is set to use multiline. But this will only work if those fields are properly quoted.
Upvotes: 1