Calab
Calab

Reputation: 363

Python, split a string at commas, except within quotes, ignoring whitespace

I've found some solutions, but the results I am getting don't match what I'm expecting.

I want to take a string, and split it at commas, except when the commas are contained within double quotation marks. I would like to ignore whitespace. I can live with losing the double quotes in the process, but it's not necessary.

Is csv the best way to do this? Would a regex solution be better?

#!/usr/local/bin/python2.7

import csv

s = 'abc,def, ghi, "jkl, mno, pqr","stu"'

result = csv.reader(s, delimiter=',', quotechar='"')

for r in result: 
    print r

# Should display:
# abc
# def
# ghi
# jkl, mno, pqr
# stu
#
# But I get:
# ['a']
# ['b']
# ['c']
# ['', '']
# ['d']
# ['e']
# ['f']
# ['', '']
# [' ']
# ['g']
# ['h']
# ['i']
# ['', '']
# [' ']
# ['jkl, mno, pqr']
# ['', '']
# ['stu']

print r[1]  # Should be "def" but I get and "list index out of range" error.

Upvotes: 1

Views: 4511

Answers (2)

CertainPerformance
CertainPerformance

Reputation: 370759

You can use the regular expression

".+?"|[\w-]+

This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).

https://regex101.com/r/IThYf7/1

import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
    print(r)

If you want to get rid of the "s around the quoted sections, the best I could figure out by using the regex module (so that \K was usable) was:

(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)

https://regex101.com/r/IThYf7/3

Upvotes: 3

Jan
Jan

Reputation: 43169

Besides using csv you could have another nice approach which is supported by the newer regex module (i.e. pip install regex):

"[^"]*"(*SKIP)(*FAIL)|,\s*


This reads as follows:

"[^"]*"(*SKIP)(*FAIL) # match everything between two double quotes and "forget" about them
|                     # or
,\s*                  # match a comma and 0+ whitespaces


In Python:

import regex as re

rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,\s*')
string = 'abc,def, ghi, "jkl, mno, pqr","stu"'

parts = rx.split(string)
print(parts)

This yields

['abc', 'def', 'ghi', '"jkl, mno, pqr"', '"stu"']

See a demo on regex101.com.

Upvotes: 0

Related Questions