Stijn
Stijn

Reputation: 121

How to read CSV file with pandas containing quotes and using multiple seperators

I am working on a project where I read multiple CSV files in a for loop, some files contain a semicolon(;) as a seperator while other contain a comma (,). I need to keep the headers in all CSV files so I cannot used fixed headers.

I am having trouble with one CSV file especially that contains quotes for some values, e.g.

Column 1,Column 2,Column 3,Column 4 12531,ABCDE,"12523,2",EXAMPLE 52313,ADE,12302,EXAMPLE

Using comma as a seperator doesn't work since I end up with 15 values instead of 13 leading to parse error.

When opening the file in notepad I see that for the lines that contains values with quotes, double quotes are added to said value aswell as a comma is added to the front and end of the line:

Column 1,Column 2,Column 3,Column 4 "12531,ABCDE,""12523,2"",EXAMPLE" 52313,ADE,12302,EXAMPLE

I would like to process this file like the other succesfully, while keeping the ability to use:

I tried replacing the , between quotes with a . but this does not eliminate the issue with the quote sign added to the front and end of the line, and it simply doesn't work as it turns out to be two quotes..

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

for csv in all_filenames:
    file = open(csv, 'r')
    content = file.read()
    content = re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '.', content)
    combined_csv = pd.read_csv(csv, sep=";|,", engine="python")
    appended_data.append(combined_csv)

appended_data = pd.concat(appended_data)

Upvotes: 3

Views: 559

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

I sugges replacing chunks of more than one " before or after a comma with a single occurrence, and then using pd.read_csv with the quotechar='"' argument to make sure the quoted fields end up in a single column:

content = re.sub(r'(?<![^,])"{2,}|"{2,}(?![^,])', '"', content)
#...
combined_csv = pd.read_csv(csv, sep=";|,", engine="python", quotechar='"')

Regex details:

  • (?<![^,]) - immediately before the current location, there must be a comma or start of string
  • "{2,} - two or more " chars
  • | - or
  • "{2,} - two or more " chars
  • (?![^,]) - immediately after the current location, there must be a comma or end of string.

Upvotes: 2

Related Questions