Reputation: 1
This is not the same question as double quoted elements in csv cant read with pandas.
The difference is that in that question: "ABC,DEF" was breaking the code.
Here, "ABC "DE" ,F" is breaking the code.
The whole string should be parsed in as 'ABC "DE", F'. Instead the inside double quotes are leading to the below-mentioned issue.
I am working with a csv file that contains the following type of entries:
header1, header2, header3,header4
2001-01-01,123456,"abc def",V4
2001-01-02,789012,"ghi "jklm" n,op",V4
The second row of data is breaking the code, with the following error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5
I have tried playing with various sep
, delimiter
& quoting
etc. arguments but nothing seems to work.
Can someone please help with this? Thank you!
Upvotes: 0
Views: 1037
Reputation: 1314
Based on the two rows you have provided here is an option where the text file is read into a Series
object and then regex extract is used via Series.str.extract()
get the information you want in a DataFrame
:
with open('so.txt') as f:
contents = f.readlines()
s = pd.Series(contents)
s
now looks like the following:
0 header1, header2, header3,header4\n
1 \n
2 2001-01-01,123456,"abc def",V4\n
3 \n
4 2001-01-02,789012,"ghi "jklm" n,op",V4
Now you can use regex extract to get what you want into a DataFrame
:
df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')
# remove empty rows
df = df.dropna(how='all')
df
looks like the following:
0 1 2 3
2 2001-01-01 123456 "abc def" V4
4 2001-01-02 789012 "ghi "jklm" n,op" V4
and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4']
Upvotes: 0