Reputation: 7674
In python, I'm am executing this:
>>> re.split("(hello|world|-)", 'hello-world')
['hello', '-', 'world']
however, I am getting this:
['', 'hello', '', '-', '', 'world', '']
where is this ''
coming from?
I am using python 3 in case it matters
Many of you are saying I could split it on -
however, I want to extract tokens
if that makes sense. Example if I had "hellohello---worldhello"
. I want it to return
['hello', 'hello', '-', '-', '-', 'world', 'hello']
Upvotes: 0
Views: 51
Reputation: 70732
According to the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
You could always use filter
to control your list if this is your concern.
>>> filter(None, re.split('(hello|world|-)', 'hellohello---worldhello'))
['hello', 'hello', '-', '-', '-', 'world', 'hello']
Or use findall
to grab your matches.
>>> re.findall('(hello|world|-)', 'hellohello---worldhello')
['hello', 'hello', '-', '-', '-', 'world', 'hello']
Upvotes: 2
Reputation: 176
The extra output elements are because you are asking re to split the string on e.g. hello, so it tries to tell you what is before hello, what is between hello and '-', etc. All are empty strings.
If you change it to :
re.split("(-)", 'hello-world')
You will get the desired result
['hello', '-', 'world']
Upvotes: 0