Reputation: 359
python 2.7 code
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = cStr.split(',')
print newStr # -> ['"aaaa"','"bbbb"','"ccc','ddd"' ]
but, I want this result.
result = ['"aaa"','"bbb"','"ccc,ddd"']
Upvotes: 31
Views: 44596
Reputation: 143
I liked Mark de Haan' solution but I had to rework it, as it removed the quote characters (although they were needed) and therefore an assertion in his example failed. I also added two additional parameters to deal with different separators and quote characters.
def tokenize( string, separator = ',', quote = '"' ):
"""
Split a comma separated string into a List of strings.
Separator characters inside the quotes are ignored.
:param string: A string to be split into chunks
:param separator: A separator character
:param quote: A character to define beginning and end of the quoted string
:return: A list of strings, one element for every chunk
"""
comma_separated_list = []
chunk = ''
in_quotes = False
for character in string:
if character == separator and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
else:
chunk += character
if character == quote:
in_quotes = False if in_quotes else True
comma_separated_list.append( chunk )
return comma_separated_list
And the tests...
def test_tokenizer():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = tokenize(string)
assert expected == actual
Upvotes: 4
Reputation: 33
It is always better to use existing libraries when you can, but I was struggling to get my specific use case to work with all the above answers, so I wrote my own for python 3.9 (will probably work until 3.6, and removing the type hinting will get you to 2.x compatability).
def separate(string) -> List[str]:
"""
Split a comma separated string into a List of strings.
Resulting list elements are trimmed of double quotes.
Comma's inside double quotes are ignored.
:param string: A string to be split into chunks
:return: A list of strings, one element for every chunk
"""
comma_separated_list: List[str] = []
chunk: str = ''
in_quotes: bool = False
for character in string:
if character == ',' and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
elif character == '"':
in_quotes = False if in_quotes else True
else:
chunk += character
comma_separated_list.append(chunk)
return comma_separated_list
And the tests...
def test_separator():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = separate(string)
assert expected == actual
Upvotes: 2
Reputation: 623
By using regex try this:
COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
split_result = COMMA_MATCHER.split(string)
Upvotes: 18
Reputation: 1089
This is not a standard module, you have to install it via pip, but as an option try tssplit:
In [3]: from tssplit import tssplit
In [4]: tssplit('"aaaa","bbbb","ccc,ddd"', quote='"', delimiter=',')
Out[4]: ['aaaa', 'bbbb', 'ccc,ddd']
Upvotes: 0
Reputation: 5156
Try to use CSV.
import csv
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([cStr], delimiter=',', quotechar='"'))[0] ]
print newStr
Check Python parse CSV ignoring comma with double-quotes
Upvotes: 26
Reputation: 195
You can first split the string by "
then filter out ''
or ','
, finally format it, it may be the simplest way:
['"%s"' % s for s in cStr.split('"') if s and s != ',']
Upvotes: 1
Reputation: 63709
pyparsing has a builtin expression, commaSeparatedList
:
cStr = '"aaaa","bbbb","ccc,ddd"'
import pyparsing as pp
print(pp.commaSeparatedList.parseString(cStr).asList())
prints:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
You can also add a parse-time action to strip those double-quotes (since you probably just want the content, not the quotation marks too):
csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(cStr).asList())
gives:
['aaaa', 'bbbb', 'ccc,ddd']
Upvotes: 12
Reputation: 502
It will be better to use regex in this case.
re.findall('".*?"', cStr)
returns exactly what you need
asterisk is greedy wildcard, if you used '".*"'
, it would return maximal match, i.e. everything in between the very first and the very last double quote. The question mark makes it non greedy, so '".*?"'
returns the smallest possible match.
Upvotes: 3
Reputation: 8202
You need a parser. You can build your own, or you may be able to press one of the library ones into service. In this case, json
could be (ab)used.
import json
cStr = '"aaaa","bbbb","ccc,ddd"'
jstr = '[' + cStr + ']'
result = json.loads( jstr) # ['aaaa', 'bbbb', 'ccc,ddd']
result = [ '"'+r+'"' for r in result ] # ['"aaaa"', '"bbbb"', '"ccc,ddd"']
Upvotes: 0
Reputation: 92854
The solution using re.split() function:
import re
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = re.split(r',(?=")', cStr)
print newStr
The output:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
,(?=")
- lookahead positive assertion, ensures that delimiter ,
is followed by double quote "
Upvotes: 35