zwlayer
zwlayer

Reputation: 1824

How to ignore the commas inside sentence while reading csv file with python pandas

I have a csv file and I'd like to read it with pandas library in Python.

Here is the header and first line of my file.

content,topic,class,NRC-Affect-Intensity-anger_Score,NRC-Affect-Intensity-fear_Score,NRC-Affect-Intensity-sadness_Score,NRC-Affect-Intensity-joy_Score
'@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.',kindle2,positive,0,0,0,0

It is comma seperated, and it has 7 fields. When I tried to read this file I got an error:

zz=  pd.read_csv('proc_data.csv', sep=',')
Error tokenizing data. C error: Expected 8 fields in line 14, saw 9

I guess it is complaning about the commas within the first column. (The part between ' characters)

Is it possible to read this file correctly?

head -15 less proc_data.csv
head: less: No such file or directory
==> proc_data.csv <==
content,topic,class,NRC-Affect-Intensity-anger_Score,NRC-Affect-Intensity-fear_Score,NRC-Affect-Intensity-sadness_Score,NRC-Affect-Intensity-joy_Score
'@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.',kindle2,positive,0,0,0,0
'Reading my kindle2...  Love it... Lee childs is good read.',kindle2,positive,0,0,0,1.375
'Ok, first assesment of the #kindle2 ...it fucking rocks!!!',kindle2,positive,0,0,0,0
'@kenburbary You\'ll love your Kindle2. I\'ve had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)',kindle2,positive,0,0,0.594,1.125
'@mikefish  Fair enough. But i have the Kindle2 and I think it\'s perfect  :)',kindle2,positive,0,0,0,0.719
'@richardebaker no. it is too big. I\'m quite happy with the Kindle2.',kindle2,positive,0,0,0,0.788
'Fuck this economy. I hate aig and their non loan given asses.',aig,negative,0.828,0.484,0.656,0
'Jquery is my new best friend.',jquery,positive,0,0,0,0.471
'Loves twitter',twitter,positive,0,0,0,0
'how can you not love Obama? he makes jokes about himself.',obama,positive,0,0,0,0.828
'Check this video out -- President Obama at the White House Correspondents\' Dinner ',obama,neutral,0,0,0,0.109
'@Karoli I firmly believe that Obama/Pelosi have ZERO desire to be civil.  It\'s a charade and a slogan, but they want to destroy conservatism',obama,negative,0,0,0,0.484
'House Correspondents dinner was last night whoopi, barbara &amp; sherri went, Obama got a standing ovation',obama,positive,0,0,0.078,0
'Watchin Espn..Jus seen this new Nike Commerical with a Puppet Lebron..sh*t was hilarious...LMAO!!!',nike,positive,0,0,0,0.672

Upvotes: 0

Views: 1121

Answers (1)

Uvar
Uvar

Reputation: 3462

You are trying to separate the columns by commas, yet within strings commas can be present.

This would normally be taken care of by the quoting argument of the read_csv method, which defaults to quoting='"'. However, in your csv file, you have single quotes, so you need to change to quoting="'".

This, however, runs into the issue that inside of the strings, apostrophes are present, which are preceded by escaping backslashes. By default, pd.read_csv has its escapechar argument set to None, so you'll have to set this one as well.

All in all, we end up with:

pd.read_csv('proc_data.csv', sep=',',quotechar="'", escapechar='\\')

Note that the escapechar itself needs to be escaped here.

If you're not that concerned about individual lines and just want to read in as much as you can succesfully parse, you can add in they keyword error_bad_lines=False. Then figure out from the warnings whether those lines can be fixed or need to be given up on.

Upvotes: 1

Related Questions