Reputation: 363
I've found some solutions, but the results I am getting don't match what I'm expecting.
I want to take a string, and split it at commas, except when the commas are contained within double quotation marks. I would like to ignore whitespace. I can live with losing the double quotes in the process, but it's not necessary.
Is csv the best way to do this? Would a regex solution be better?
#!/usr/local/bin/python2.7
import csv
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
result = csv.reader(s, delimiter=',', quotechar='"')
for r in result:
print r
# Should display:
# abc
# def
# ghi
# jkl, mno, pqr
# stu
#
# But I get:
# ['a']
# ['b']
# ['c']
# ['', '']
# ['d']
# ['e']
# ['f']
# ['', '']
# [' ']
# ['g']
# ['h']
# ['i']
# ['', '']
# [' ']
# ['jkl, mno, pqr']
# ['', '']
# ['stu']
print r[1] # Should be "def" but I get and "list index out of range" error.
Upvotes: 1
Views: 4511
Reputation: 370759
You can use the regular expression
".+?"|[\w-]+
This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).
https://regex101.com/r/IThYf7/1
import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
print(r)
If you want to get rid of the "
s around the quoted sections, the best I could figure out by using the regex
module (so that \K
was usable) was:
(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)
https://regex101.com/r/IThYf7/3
Upvotes: 3
Reputation: 43169
Besides using csv
you could have another nice approach which is supported by the newer regex
module (i.e. pip install regex
):
"[^"]*"(*SKIP)(*FAIL)|,\s*
"[^"]*"(*SKIP)(*FAIL) # match everything between two double quotes and "forget" about them
| # or
,\s* # match a comma and 0+ whitespaces
Python
:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,\s*')
string = 'abc,def, ghi, "jkl, mno, pqr","stu"'
parts = rx.split(string)
print(parts)
This yields
['abc', 'def', 'ghi', '"jkl, mno, pqr"', '"stu"']
Upvotes: 0