Reputation: 18893
Data:
from io import StringIO
import pandas as pd
s = '''ID,Level,QID,Text,ResponseID,responseText,date_key
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00'''
df = pd.read_csv(StringIO(s))
Error received:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 9
It's very obvious why I'm receiving this error. The data contains text such as How often? (at home, at work, other)
and Do you prefer a, b, or c?
.
How does one read this type of data into a pandas DataFrame?
Upvotes: 2
Views: 1443
Reputation: 18893
Of course, as I write the question, I figured it out. Rather than delete it, I'll share it with my future self when I forget how to do this.
Apparently, pandas default sep=','
can also be a regular expression.
The solution was to add sep=r',(?!\s)'
to read_csv
like so:
df = pd.read_csv(StringIO(s), sep=r',(?!\s)')
The (?!\s)
part is a negative lookahead to match only commas that don't have a following space after them.
Result:
ID Level QID Text ResponseID \
0 375280046 S D3M Which is your favorite? D5M0
1 375280046 S D3M How often? (at home, at work, other) D3M0
2 375280046 M A78 Do you prefer a, b, or c? A78C
responseText date_key
0 option 1 2012-08-08 00:00:00
1 Work 2010-03-31 00:00:00
2 a 2010-03-31 00:00:00
Upvotes: 2