Reputation: 851
I have a large CSV file and there are a few entries in one of the columns that use a comma within the entry. For example, one entry might read something like "\"a,bcde\""
, another entry might read something like "\"abcde\""
, and another entry will read like <unknown>
. So, when the entry is known (sometimes it's also just an empty string, e.g., ""
), the pattern is clear in that it starts with "\"
and ends with \""
, and everything inside these markers is the entry I want.
When I try to pull this into pandas, I get an error when the entry with the comma is pulled in (Error tokenizing data. C error: Expected 32 fields in line 24193, saw 33
). So I'm not sure what to do with this. Any suggestions are welcome!
Upvotes: 1
Views: 73
Reputation: 851
Here is what finally worked for me, given that some of the encoding was not utf-8.
import re
from io import StringIO
import pandas as pd
def multiple_replace(d, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, d.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: d[mo.string[mo.start():mo.end()]], text)
d = {
'"\\"' : '"',
'\\""' : '"',
}
with open("file.csv", 'r', errors='ignore') as f: #ignore errors to deal with non-utf-8 encoding
lines = [multiple_replace(d, l) for l in f]
df = pd.read_csv(StringIO(''.join(lines)))
Upvotes: 0
Reputation: 2706
This is a max up of two methods, the first is a multiple replace and the second is reading the csv into memory doing the replaces you need and then putting it into pandas. It will be slow for large files.
import re
from io import StringIO
import pandas as pd
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
dict = {
'"\"' : '"',
'\""' : '"',
}
with open('file.csv') as file:
for line in file:
lines = [multiple_replace(dict, line) for line in file]
df = pd.read_csv(StringIO('\n'.join(lines)))
Upvotes: 2