Reputation: 1494
I'm reading a basic csv file where the columns are separated by commas with these column names:
userid, username, body
However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:
CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8
Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?
Upvotes: 37
Views: 71436
Reputation: 1003
First, I didn't find anything to resolve "comma inside qoutes, issue" systematically and properly. The pandas=1.5.3 is unable to parse it correctly. Tried to specify parameters like qoutechar, quoting, escapechar, lineterminator, ...
Finally, found two workaround solutions taking advantage I know the comma could be in the last column only. Assume following content of csv
userid, username, body
1, Joe, string1
2, Jim, "string21, string22"
If you don't mind the part after 3rd comma is lost then specify number of columns
pd.read_csv(r'c:\TEMP\to_parse.csv',usecols=range(3))
which yields
userid username body
0 1 Joe string1
1 2 Jim "string21
Second workaround is more complicated but it yields complete string with comma. The principle is to replace first 2 commas by semicolon (you must know the number of columns)
with open(path, 'r') as f:
fo = io.StringIO()
data = f.readlines()
fo.writelines(u"" + line.replace(';', ':').replace(',', ';', 2) for line in data)
fo.seek(0)
df = pd.read_csv(fo, on_bad_lines='warn', sep=';')
Perhaps it could be accomplished by regex as well.
Upvotes: 0
Reputation: 1323
for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines
parameter that let you use a callback function. Here is what I did:
def manual_separation(bad_line):
right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing ","
return right_split
filename = "netflix_movie_titles.csv"
df = pd.read_csv(
filename,
header=None,
encoding="ISO-8859-1",
names = ['Movie_Id', 'Year', 'Name'],
on_bad_lines=manual_separation,
engine="python",
)
Works like a charm! Your only obligation is to use engine=python
. Hope that helps!
Upvotes: 8
Reputation: 2670
Does this help?
import csv
with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f:
csvread = csv.reader(f)
batch_data = list(csvread)
print(batch_data)
Reference:
[1] https://stackoverflow.com/a/40477760/6907424
[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424
Upvotes: 1
Reputation: 2176
Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.
In my case:
n = 5 #define yours
df = pd.read_csv(file,
usecols=range(n),
lineterminator='\n',
header=None)
Upvotes: 21
Reputation: 21552
Imagine we're reading your dataframe called comma.csv
:
userid, username, body
01, n1, 'string1, string2'
One thing you can do is to specify the delimiter of the strings in the column with:
df = pd.read_csv('comma.csv', quotechar="'")
In this case strings delimited by '
are considered as total, no matter commas inside them.
Upvotes: 39