Reputation: 121
I am working on a project where I read multiple CSV files in a for loop, some files contain a semicolon(;) as a seperator while other contain a comma (,). I need to keep the headers in all CSV files so I cannot used fixed headers.
I am having trouble with one CSV file especially that contains quotes for some values, e.g.
Column 1,Column 2,Column 3,Column 4 12531,ABCDE,"12523,2",EXAMPLE 52313,ADE,12302,EXAMPLE
Using comma as a seperator doesn't work since I end up with 15 values instead of 13 leading to parse error.
When opening the file in notepad I see that for the lines that contains values with quotes, double quotes are added to said value aswell as a comma is added to the front and end of the line:
Column 1,Column 2,Column 3,Column 4 "12531,ABCDE,""12523,2"",EXAMPLE" 52313,ADE,12302,EXAMPLE
I would like to process this file like the other succesfully, while keeping the ability to use:
I tried replacing the , between quotes with a . but this does not eliminate the issue with the quote sign added to the front and end of the line, and it simply doesn't work as it turns out to be two quotes..
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for csv in all_filenames:
file = open(csv, 'r')
content = file.read()
content = re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '.', content)
combined_csv = pd.read_csv(csv, sep=";|,", engine="python")
appended_data.append(combined_csv)
appended_data = pd.concat(appended_data)
Upvotes: 3
Views: 559
Reputation: 626932
I sugges replacing chunks of more than one "
before or after a comma with a single occurrence, and then using pd.read_csv
with the quotechar='"'
argument to make sure the quoted fields end up in a single column:
content = re.sub(r'(?<![^,])"{2,}|"{2,}(?![^,])', '"', content)
#...
combined_csv = pd.read_csv(csv, sep=";|,", engine="python", quotechar='"')
Regex details:
(?<![^,])
- immediately before the current location, there must be a comma or start of string"{2,}
- two or more "
chars|
- or"{2,}
- two or more "
chars(?![^,])
- immediately after the current location, there must be a comma or end of string.Upvotes: 2