Exclude matched string python re.findall

Question

I am using python's re.findall method to find occurrence of certain string value in Input string. e.g. From search in 'ABCdef' string, I have two search requirements.

Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.

e.g. input string and expected output will be:

'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']

So My expectation from

>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )

is to return ['AB', 'Cdef'].

But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.

So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?

Rohit Jain · Accepted Answer

First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.

Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.

So, you can use this pattern:

>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']

>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']

As pointed out in comment by @sotapme, you can also modify the above regex to: -

"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"

Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.

Exclude matched string python re.findall

Answers (2)

Related Questions