Reputation: 2664

Python RE re.split(), results start with empty string

I have some questions on the split() description/examples from the Python RE documents

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

In this example there is a capturing group, it matches at the start and end of the string, thus, the result starts and ends with an empty string. Outside of understanding that this happens, I would like to better understand the reasoning. The explanation for this is:

That way, separator components are always found at the same relative indices within the result list.

Could someone expand on this? Relative to what?

My other query is related to this example:

re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

\w will match any character that can be used in any word in any language (Flag:unicode), or will be the equivalent of [a-zA-Z0-9_] (Flag:ASCII), \W is the inverse of this. Can someone talk about each of the matches in the example above, explain each (if possible) in terms of what is matched (\B, \U, ...).

Added 29/01/2019:

Apart of what I am after wasn't stated very clear (my bad). In terms of the second example, I am curious about the steps taken to come to the result (how the python re module processed the example). After reading this post on Zero-Length Regex Matches things are clearer, but I would still be interest if anyone can break down the logic up to ['', '...', '', '', 'w', in the results.

Upvotes: 0

Answers (2)

Tyl

Reputation: 5252

Check this:

>>> re.split('(a)', 'awords')
['', 'a', 'words']
>>> re.split('(w)', 'awords')
['a', 'w', 'ords']
>>> re.split('(o)', 'awords')
['aw', 'o', 'rds']
>>> re.split('(s)', 'awords')
['aword', 's', '']

Always at the second place (index of 1).

On the other hand:

>>> re.split('a', 'awords')
['', 'words']
>>> re.split('w', 'awords')
['a', 'ords']
>>> re.split('s', 'awords')
['aword', '']

Almost the same, only the catching group not inside.

Upvotes: 1

Barmar

Reputation: 781741

What it's trying to say is that when you have a capturing group in the delimiter, and it matches the beginning of the string, the resulting list will always start with the delimiter. Similarly, if it matches at the end of the string, the list will always end with the delimiter.

For consistency, this is true even when the delimiter matches an empty string. The input string is considered to have an empty string before the first character and after the last character, and the delimiter will match these. And then they'll be the first and last elements of the resulting list.

Upvotes: 2

Python RE re.split(), results start with empty string

Answers (2)

Related Questions