Anderson Green
Anderson Green

Reputation: 31840

How to split a string in Python without redundant output

I've been trying to split a string using a regular expression as a separator, but the output of string.split appears to contain some redundant results.

import re;
replaceArray = '((Replace the string)|((in|inside|within) the string)|(with the string))'
stringToSplit = '(Replace the string arr1 in the array arr2 with the array arr3)'
print(re.split(replaceArray, stringToSplit))

I expected the split string to look like this, without any overlapping results:

['Replace the string', ' arr1 ', 'in the string', ' arr2 ', 'with the string', ' arr3']

But instead, the array of split strings contained some redundant results, which appear to overlap with the other matched strings:

['', 'Replace the string', 'Replace the string', None, None, None, ' arr1 ', 'in the string', None, 'in the string', 'in', None, ' arr2 ', 'with the string', None, None, None, 'with the string', ' arr3']

Is there any way to prevent these redundant and overlapping results from being included in the output of string.split?

Upvotes: 0

Views: 222

Answers (3)

poke
poke

Reputation: 388163

Matching groups prepended with ?: are non-capturing groups and will not appear in the output. Furthermore, you probably don’t want to use re.split here but re.match instead—you’re not really interested in splitting the string but instead you want to extract those groups out of it.

>>> expr = '\((Replace the array (.*?)) ((?:in|inside|within) the array (.*?)) (with the array (.*?))\)'
>>> re.match(expr, stringToSplit).groups()
('Replace the array arr1', 'arr1', 'in the array arr2', 'arr2', 'with the array arr3', 'arr3')

Or

>>> expr = '\((Replace the array) (.*?) ((?:in|inside|within) the array) (.*?) (with the array) (.*?)\)'
>>> re.match(expr, stringToSplit).groups()
('Replace the array', 'arr1', 'in the array', 'arr2', 'with the array', 'arr3')

Upvotes: 1

jwd
jwd

Reputation: 11144

From the docs on re.split:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

I think you want to use non-capturing groups in your regex. That is, instead of using (...), use (?:...)

Upvotes: 1

Andrew Clark
Andrew Clark

Reputation: 208625

If you have capturing groups in your regex the results of re.split() will include those capturing groups. Add ?: to the beginning of all of your groups to make them non-capturing. Several of those groups are not actually necessary, try the following:

replaceArray = 'Replace the string|(?:in|inside|within) the string|with the string'

Upvotes: 2

Related Questions