ghostcoder
ghostcoder

Reputation: 493

split a comma, space, or semicolon separated string using regex

I use the regex [,;\s]+ to split a comma, space, or semicolon separated string. This works fine if the string doesn't have a comma at the end:

>>> p=re.compile('[,;\s]+')
>>> mystring='a,,b,c'
>>> p.split(mystring)
['a', 'b', 'c']

When the string has a comma at the end:

>>> mystring='a,,b,c,'
>>> p.split(mystring)
['a', 'b', 'c', '']

I want the output in this case to be ['a', 'b', 'c'].

Any suggestions on the regex?

Upvotes: 4

Views: 20291

Answers (3)

inspectorG4dget
inspectorG4dget

Reputation: 113945

Here's something very low tech that should still work:

mystring='a,,b,c'
for delim in ',;':
    mystring = mystring.replace(delim, ' ')
results = mystring.split()

PS: While regexes are very useful, I would strongly suggest thinking twice about whether it is the right tool for the job here. While I'm not sure what the exact runtime of a compiled regex is (I'm thinking at most O(n^2)), it is definitely not faster than O(n), which is the runtime of string.replace. So unless there is a different reason for which you need to use a regex, you should be set with this solution

Upvotes: 8

mathematical.coffee
mathematical.coffee

Reputation: 56915

Well, the split technically did work. In a,,b,c, it splits on ,, and ,, leaving "a","b", and "c". In a,,b,c,, it splits on ,,, , and the last , (because they all match the regex!). The strings "around" those delmiters are "a","b","c", and "" (between the last comma and the end of string).

There are few ways you can circumvent this.

  • The empty string will only occur if there's a delimiter at the start or end of the string, so trim off any of these [,;\s] prior to splitting using str.strip:

    p.split(mystring.strip(',; \t\r\n'))
    
  • Remove the empty string after the splitting, using any method you please

    res = p.split(mystring)
    [r for r in res if r != '']
    # another option
    filter(None,res)
    
  • Even better, since you know you'll only get the empty string as either the first or last part of the split string (e.g. ,a,b,c or a,b,c,), don't iterate through the entire split:

    res = p.slit(mystring)
    # this one relies on coercing logical to numbers:
    # if res[0] is '' it'll be 1:X, otherwise it'll be 0:X,
    #  where X is len(res) if res[-1] is not '', and len(res)-1 otherwise.
    res[ res[0]=='':(len(res)-(res[-1]==''))]
    

Upvotes: 4

Qtax
Qtax

Reputation: 33908

Try:

str = 'a,,b,c,'
re.findall(r'[^,;\s]+', str)

Upvotes: 9

Related Questions