David
David

Reputation: 1494

pandas read csv with extra commas in column

I'm reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

Upvotes: 37

Views: 71436

Answers (5)

Karel Marik
Karel Marik

Reputation: 1003

First, I didn't find anything to resolve "comma inside qoutes, issue" systematically and properly. The pandas=1.5.3 is unable to parse it correctly. Tried to specify parameters like qoutechar, quoting, escapechar, lineterminator, ...

Finally, found two workaround solutions taking advantage I know the comma could be in the last column only. Assume following content of csv

userid, username, body
1, Joe, string1
2, Jim, "string21, string22"

If you don't mind the part after 3rd comma is lost then specify number of columns

pd.read_csv(r'c:\TEMP\to_parse.csv',usecols=range(3))

which yields

   userid  username        body
0       1       Joe     string1
1       2       Jim   "string21

Second workaround is more complicated but it yields complete string with comma. The principle is to replace first 2 commas by semicolon (you must know the number of columns)

with open(path, 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(u"" + line.replace(';', ':').replace(',', ';', 2) for line in data)
    fo.seek(0)
df = pd.read_csv(fo, on_bad_lines='warn', sep=';')

Perhaps it could be accomplished by regex as well.

Upvotes: 0

Antoine Krajnc
Antoine Krajnc

Reputation: 1323

for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines parameter that let you use a callback function. Here is what I did:

def manual_separation(bad_line):
    right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing ","
    return right_split

filename = "netflix_movie_titles.csv"
df = pd.read_csv(
        filename, 
        header=None,
        encoding="ISO-8859-1",
        names = ['Movie_Id', 'Year', 'Name'], 
        on_bad_lines=manual_separation,
        engine="python",
    )

Works like a charm! Your only obligation is to use engine=python. Hope that helps!

Upvotes: 8

hafiz031
hafiz031

Reputation: 2670

Does this help?

import csv
with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f:
    csvread = csv.reader(f)
    batch_data = list(csvread)
    print(batch_data)

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

Upvotes: 1

Ilyas
Ilyas

Reputation: 2176

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours
df = pd.read_csv(file,
                 usecols=range(n),
                 lineterminator='\n',
                 header=None)

Upvotes: 21

Fabio Lamanna
Fabio Lamanna

Reputation: 21552

Imagine we're reading your dataframe called comma.csv:

userid, username, body
01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by ' are considered as total, no matter commas inside them.

Upvotes: 39

Related Questions