CopyOfA
CopyOfA

Reputation: 851

Dealing with delimiters in column entries pandas

I have a large CSV file and there are a few entries in one of the columns that use a comma within the entry. For example, one entry might read something like "\"a,bcde\"", another entry might read something like "\"abcde\"", and another entry will read like <unknown>. So, when the entry is known (sometimes it's also just an empty string, e.g., ""), the pattern is clear in that it starts with "\" and ends with \"", and everything inside these markers is the entry I want.

When I try to pull this into pandas, I get an error when the entry with the comma is pulled in (Error tokenizing data. C error: Expected 32 fields in line 24193, saw 33). So I'm not sure what to do with this. Any suggestions are welcome!

Upvotes: 1

Views: 73

Answers (2)

CopyOfA
CopyOfA

Reputation: 851

Here is what finally worked for me, given that some of the encoding was not utf-8.

import re
from io import StringIO
import pandas as pd

def multiple_replace(d, text):
    # Create a regular expression  from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, d.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: d[mo.string[mo.start():mo.end()]], text) 

d = {
    '"\\"' : '"',
    '\\""' : '"',
    }


with open("file.csv", 'r', errors='ignore') as f: #ignore errors to deal with non-utf-8 encoding
    lines = [multiple_replace(d, l) for l in f]

df = pd.read_csv(StringIO(''.join(lines)))

Upvotes: 0

Paul Brennan
Paul Brennan

Reputation: 2706

This is a max up of two methods, the first is a multiple replace and the second is reading the csv into memory doing the replaces you need and then putting it into pandas. It will be slow for large files.

import re
from io import StringIO
import pandas as pd

def multiple_replace(dict, text):
    # Create a regular expression  from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

dict = {
    '"\"' : '"',
    '\""' : '"',
} 

with open('file.csv') as file:
    for line in file:
    lines = [multiple_replace(dict, line) for line in file]
    

df = pd.read_csv(StringIO('\n'.join(lines)))

Upvotes: 2

Related Questions