Reputation: 31
I've downloaded a large csv file which uses "," as seperators (without the ""). The current code seems to read in some lines correctly yet some are not splitted by the "," but rather everything is inserted into the first Column ... The problem seems to be that for some of the rows there are additional "," in the Text column hence why there is a " quote before Dr and after .... in Line 3
Is there a way to split the file in order to get the desired output while keeping the "," that are between the two " " in the text column?
Example CSV File Name = TwitterData_2017.csv:
Username, date, retweets, favorites, text
,2017-01-02,0,0,History makes this very clear ....
,2017-01-02,0,0,S&P reaches new heights ....
,2017-01-02,0,0,"Dr Pepper ,Snapple Group Projection ...."
,2017-01-02,0,0,S&P is going strong ....
Code:
import pandas as pd
import numpy as np
rawData = pd.read_csv('TwitterData_2017.csv', sep=",", quotechar='"')
print(rawData.head(n=4))
Output:
Username Date Retweets favorites text
NaN 2017-01-02 0 0 History makes this very clear ....
NaN 2017-01-02 0 0 S&P reaches new heights ....
,2017-01-02,0,0,"Dr Pepper ,Snapple Group Projection ...."
NaN 2017-01-02 0 0 S&P is going strong ....
As you can see the code seems to work for line 1,2 & 4 but fails in Line 3. This seems to be caused by the fact that that column has " in the beginning and end due to there being an extra ",".
I'm using Python 3 and running everything through IntelliJ.
I would appreciate advice in how I can retrify this and make everything into the same format?
Ps: I have other rows that contain multiple "," betweem the two " " in the text column and if possible I would like to neglect those (not split them)
Upvotes: 0
Views: 565
Reputation: 704
you should pass the quotechar:
import pandas as pd
import numpy as np
rawData = pd.read_csv('TwitterData_2017.csv', sep=",", quotechar='"')
print(rawData.head(n=4))
" before Dr., for example, it's because csv use a character to englobe long strings with separator character inside, by default is ". So you need to pass a quotechar on read so the csv parse knows when a string with separator starts and finishes.
Upvotes: 1