Splitting using regex is giving unwanted empty strings

Question

In python, I'm am executing this:

>>> re.split("(hello|world|-)", 'hello-world')

I am expecting this:
['hello', '-', 'world']

however, I am getting this:
['', 'hello', '', '-', '', 'world', '']

where is this '' coming from?

I am using python 3 in case it matters

Edit

Many of you are saying I could split it on - however, I want to extract tokens if that makes sense. Example if I had "hellohello---worldhello". I want it to return

['hello', 'hello', '-', '-', '-', 'world', 'hello']

hwnd · Accepted Answer

According to the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

You could always use filter to control your list if this is your concern.

>>> filter(None, re.split('(hello|world|-)', 'hellohello---worldhello'))
['hello', 'hello', '-', '-', '-', 'world', 'hello']

Or use findall to grab your matches.

>>> re.findall('(hello|world|-)', 'hellohello---worldhello')
['hello', 'hello', '-', '-', '-', 'world', 'hello']

Splitting using regex is giving unwanted empty strings

Edit

Answers (2)

Related Questions