user10771507
user10771507

Reputation: 1

Pandas read_csv - How to handle a comma inside double quotes that are themselves inside double quotes

This is not the same question as double quoted elements in csv cant read with pandas.

The difference is that in that question: "ABC,DEF" was breaking the code.

Here, "ABC "DE" ,F" is breaking the code.

The whole string should be parsed in as 'ABC "DE", F'. Instead the inside double quotes are leading to the below-mentioned issue.

I am working with a csv file that contains the following type of entries:

header1, header2, header3,header4

2001-01-01,123456,"abc def",V4

2001-01-02,789012,"ghi "jklm" n,op",V4

The second row of data is breaking the code, with the following error:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5

I have tried playing with various sep, delimiter & quoting etc. arguments but nothing seems to work.

Can someone please help with this? Thank you!

Upvotes: 0

Views: 1037

Answers (1)

jeschwar
jeschwar

Reputation: 1314

Based on the two rows you have provided here is an option where the text file is read into a Series object and then regex extract is used via Series.str.extract() get the information you want in a DataFrame:

with open('so.txt') as f:
    contents = f.readlines()

s = pd.Series(contents)

s now looks like the following:

0 header1, header2, header3,header4\n 1 \n 2 2001-01-01,123456,"abc def",V4\n 3 \n 4 2001-01-02,789012,"ghi "jklm" n,op",V4

Now you can use regex extract to get what you want into a DataFrame:

df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')

# remove empty rows
df = df.dropna(how='all')

df looks like the following:

0 1 2 3 2 2001-01-01 123456 "abc def" V4 4 2001-01-02 789012 "ghi "jklm" n,op" V4

and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4']

Upvotes: 0

Related Questions