Jongpyo Jeon
Jongpyo Jeon

Reputation: 359

Split by comma and how to exclude comma from quotes in split

python 2.7 code

cStr = '"aaaa","bbbb","ccc,ddd"' 
newStr = cStr.split(',')
print newStr  # -> ['"aaaa"','"bbbb"','"ccc','ddd"' ]

but, I want this result.

result = ['"aaa"','"bbb"','"ccc,ddd"'] 

Upvotes: 31

Views: 44596

Answers (10)

trodevel
trodevel

Reputation: 143

I liked Mark de Haan' solution but I had to rework it, as it removed the quote characters (although they were needed) and therefore an assertion in his example failed. I also added two additional parameters to deal with different separators and quote characters.

def tokenize( string, separator = ',', quote = '"' ):
    """
    Split a comma separated string into a List of strings.

    Separator characters inside the quotes are ignored.

    :param string: A string to be split into chunks
    :param separator: A separator character
    :param quote: A character to define beginning and end of the quoted string
    :return: A list of strings, one element for every chunk
    """
    comma_separated_list = []

    chunk = ''
    in_quotes = False

    for character in string:
        if character == separator and not in_quotes:
            comma_separated_list.append(chunk)
            chunk = ''

        else:
            chunk += character
            if character == quote:
                in_quotes = False if in_quotes else True

    comma_separated_list.append( chunk )

    return comma_separated_list

And the tests...

def test_tokenizer():
    string = '"aaaa","bbbb","ccc,ddd"' 

    expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
    actual = tokenize(string)

    assert expected == actual

Upvotes: 4

Mark de Haan
Mark de Haan

Reputation: 33

It is always better to use existing libraries when you can, but I was struggling to get my specific use case to work with all the above answers, so I wrote my own for python 3.9 (will probably work until 3.6, and removing the type hinting will get you to 2.x compatability).

def separate(string) -> List[str]:
    """
    Split a comma separated string into a List of strings.

    Resulting list elements are trimmed of double quotes.
    Comma's inside double quotes are ignored.

    :param string: A string to be split into chunks
    :return: A list of strings, one element for every chunk
    """
    comma_separated_list: List[str] = []

    chunk: str = ''
    in_quotes: bool = False

    for character in string:
        if character == ',' and not in_quotes:
            comma_separated_list.append(chunk)
            chunk = ''

        elif character == '"':
            in_quotes = False if in_quotes else True

        else:
            chunk += character

    comma_separated_list.append(chunk)
    return comma_separated_list

And the tests...

def test_separator():
    string = '"aaaa","bbbb","ccc,ddd"' 

    expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
    actual = separate(string)

    assert expected == actual

Upvotes: 2

Gosha null
Gosha null

Reputation: 623

By using regex try this:

COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
split_result = COMMA_MATCHER.split(string)

enter image description here

Upvotes: 18

Mikhail Zakharov
Mikhail Zakharov

Reputation: 1089

This is not a standard module, you have to install it via pip, but as an option try tssplit:

In [3]: from tssplit import tssplit
In [4]: tssplit('"aaaa","bbbb","ccc,ddd"', quote='"', delimiter=',')                                                            
Out[4]: ['aaaa', 'bbbb', 'ccc,ddd']

Upvotes: 0

ghchoi
ghchoi

Reputation: 5156

Try to use CSV.

import csv
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([cStr], delimiter=',', quotechar='"'))[0] ]

print newStr

Check Python parse CSV ignoring comma with double-quotes

Upvotes: 26

jerliol
jerliol

Reputation: 195

You can first split the string by " then filter out '' or ',', finally format it, it may be the simplest way:

['"%s"' % s for s in cStr.split('"') if s and s != ',']

Upvotes: 1

PaulMcG
PaulMcG

Reputation: 63709

pyparsing has a builtin expression, commaSeparatedList:

cStr = '"aaaa","bbbb","ccc,ddd"' 
import pyparsing as pp
print(pp.commaSeparatedList.parseString(cStr).asList())

prints:

['"aaaa"', '"bbbb"', '"ccc,ddd"']

You can also add a parse-time action to strip those double-quotes (since you probably just want the content, not the quotation marks too):

csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(cStr).asList())

gives:

['aaaa', 'bbbb', 'ccc,ddd']

Upvotes: 12

Juraj Bezručka
Juraj Bezručka

Reputation: 502

It will be better to use regex in this case. re.findall('".*?"', cStr) returns exactly what you need

asterisk is greedy wildcard, if you used '".*"', it would return maximal match, i.e. everything in between the very first and the very last double quote. The question mark makes it non greedy, so '".*?"' returns the smallest possible match.

Upvotes: 3

nigel222
nigel222

Reputation: 8202

You need a parser. You can build your own, or you may be able to press one of the library ones into service. In this case, json could be (ab)used.

import json

cStr = '"aaaa","bbbb","ccc,ddd"' 
jstr = '[' + cStr + ']'
result = json.loads( jstr)             # ['aaaa', 'bbbb', 'ccc,ddd']
result = [ '"'+r+'"' for r in result ] # ['"aaaa"', '"bbbb"', '"ccc,ddd"']

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

The solution using re.split() function:

import re

cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = re.split(r',(?=")', cStr)

print newStr

The output:

['"aaaa"', '"bbbb"', '"ccc,ddd"']

,(?=") - lookahead positive assertion, ensures that delimiter , is followed by double quote "

Upvotes: 35

Related Questions