Krimson
Krimson

Reputation: 7674

Splitting using regex is giving unwanted empty strings

In python, I'm am executing this:

>>> re.split("(hello|world|-)", 'hello-world')


I am expecting this:
['hello', '-', 'world']

however, I am getting this:
['', 'hello', '', '-', '', 'world', '']

where is this '' coming from?

I am using python 3 in case it matters


Edit

Many of you are saying I could split it on - however, I want to extract tokens if that makes sense. Example if I had "hellohello---worldhello". I want it to return

['hello', 'hello', '-', '-', '-', 'world', 'hello']

Upvotes: 0

Views: 51

Answers (2)

hwnd
hwnd

Reputation: 70732

According to the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

You could always use filter to control your list if this is your concern.

>>> filter(None, re.split('(hello|world|-)', 'hellohello---worldhello'))
['hello', 'hello', '-', '-', '-', 'world', 'hello']

Or use findall to grab your matches.

>>> re.findall('(hello|world|-)', 'hellohello---worldhello')
['hello', 'hello', '-', '-', '-', 'world', 'hello']

Upvotes: 2

Ethan Gutmann
Ethan Gutmann

Reputation: 176

The extra output elements are because you are asking re to split the string on e.g. hello, so it tries to tell you what is before hello, what is between hello and '-', etc. All are empty strings.

If you change it to :

re.split("(-)", 'hello-world')

You will get the desired result

['hello', '-', 'world']

Upvotes: 0

Related Questions