Reputation: 493
I use the regex [,;\s]+ to split a comma, space, or semicolon separated string. This works fine if the string doesn't have a comma at the end:
>>> p=re.compile('[,;\s]+')
>>> mystring='a,,b,c'
>>> p.split(mystring)
['a', 'b', 'c']
When the string has a comma at the end:
>>> mystring='a,,b,c,'
>>> p.split(mystring)
['a', 'b', 'c', '']
I want the output in this case to be ['a', 'b', 'c'].
Any suggestions on the regex?
Upvotes: 4
Views: 20291
Reputation: 113945
Here's something very low tech that should still work:
mystring='a,,b,c'
for delim in ',;':
mystring = mystring.replace(delim, ' ')
results = mystring.split()
PS:
While regexes are very useful, I would strongly suggest thinking twice about whether it is the right tool for the job here. While I'm not sure what the exact runtime of a compiled regex is (I'm thinking at most O(n^2)), it is definitely not faster than O(n), which is the runtime of string.replace
. So unless there is a different reason for which you need to use a regex, you should be set with this solution
Upvotes: 8
Reputation: 56915
Well, the split technically did work. In a,,b,c
, it splits on ,,
and ,
, leaving "a","b", and "c". In a,,b,c,
, it splits on ,,
, ,
and the last ,
(because they all match the regex!). The strings "around" those delmiters are "a","b","c", and "" (between the last comma and the end of string).
There are few ways you can circumvent this.
The empty string will only occur if there's a delimiter at the start or end of the string, so trim off any of these [,;\s]
prior to splitting using str.strip
:
p.split(mystring.strip(',; \t\r\n'))
Remove the empty string after the splitting, using any method you please
res = p.split(mystring)
[r for r in res if r != '']
# another option
filter(None,res)
Even better, since you know you'll only get the empty string as either the first or last part of the split string (e.g. ,a,b,c
or a,b,c,
), don't iterate through the entire split:
res = p.slit(mystring)
# this one relies on coercing logical to numbers:
# if res[0] is '' it'll be 1:X, otherwise it'll be 0:X,
# where X is len(res) if res[-1] is not '', and len(res)-1 otherwise.
res[ res[0]=='':(len(res)-(res[-1]==''))]
Upvotes: 4