Reputation: 31840
I've been trying to split a string using a regular expression as a separator, but the output of string.split
appears to contain some redundant results.
import re;
replaceArray = '((Replace the string)|((in|inside|within) the string)|(with the string))'
stringToSplit = '(Replace the string arr1 in the array arr2 with the array arr3)'
print(re.split(replaceArray, stringToSplit))
I expected the split string to look like this, without any overlapping results:
['Replace the string', ' arr1 ', 'in the string', ' arr2 ', 'with the string', ' arr3']
But instead, the array of split strings contained some redundant results, which appear to overlap with the other matched strings:
['', 'Replace the string', 'Replace the string', None, None, None, ' arr1 ', 'in the string', None, 'in the string', 'in', None, ' arr2 ', 'with the string', None, None, None, 'with the string', ' arr3']
Is there any way to prevent these redundant and overlapping results from being included in the output of string.split
?
Upvotes: 0
Views: 222
Reputation: 388163
Matching groups prepended with ?:
are non-capturing groups and will not appear in the output. Furthermore, you probably don’t want to use re.split
here but re.match
instead—you’re not really interested in splitting the string but instead you want to extract those groups out of it.
>>> expr = '\((Replace the array (.*?)) ((?:in|inside|within) the array (.*?)) (with the array (.*?))\)'
>>> re.match(expr, stringToSplit).groups()
('Replace the array arr1', 'arr1', 'in the array arr2', 'arr2', 'with the array arr3', 'arr3')
Or
>>> expr = '\((Replace the array) (.*?) ((?:in|inside|within) the array) (.*?) (with the array) (.*?)\)'
>>> re.match(expr, stringToSplit).groups()
('Replace the array', 'arr1', 'in the array', 'arr2', 'with the array', 'arr3')
Upvotes: 1
Reputation: 11144
From the docs on re.split
:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
I think you want to use non-capturing groups in your regex.
That is, instead of using (...)
, use (?:...)
Upvotes: 1
Reputation: 208625
If you have capturing groups in your regex the results of re.split()
will include those capturing groups. Add ?:
to the beginning of all of your groups to make them non-capturing. Several of those groups are not actually necessary, try the following:
replaceArray = 'Replace the string|(?:in|inside|within) the string|with the string'
Upvotes: 2