Jarad
Jarad

Reputation: 18893

Read CSV with extra commas and no quotechar with Pandas?

Data:

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00'''

df = pd.read_csv(StringIO(s))

Error received:

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 9

It's very obvious why I'm receiving this error. The data contains text such as How often? (at home, at work, other) and Do you prefer a, b, or c?.

How does one read this type of data into a pandas DataFrame?

Upvotes: 2

Views: 1443

Answers (1)

Jarad
Jarad

Reputation: 18893

Of course, as I write the question, I figured it out. Rather than delete it, I'll share it with my future self when I forget how to do this.

Apparently, pandas default sep=',' can also be a regular expression.

The solution was to add sep=r',(?!\s)' to read_csv like so:

df = pd.read_csv(StringIO(s), sep=r',(?!\s)')

The (?!\s) part is a negative lookahead to match only commas that don't have a following space after them.

Result:

          ID Level  QID                                  Text ResponseID  \
0  375280046     S  D3M               Which is your favorite?       D5M0   
1  375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2  375280046     M  A78             Do you prefer a, b, or c?       A78C   

  responseText             date_key  
0     option 1  2012-08-08 00:00:00  
1         Work  2010-03-31 00:00:00  
2            a  2010-03-31 00:00:00  

Upvotes: 2

Related Questions