lalit
lalit

Reputation: 3433

Exclude matched string python re.findall

I am using python's re.findall method to find occurrence of certain string value in Input string. e.g. From search in 'ABCdef' string, I have two search requirements.

  1. Find string starting from Single Capital letter.
  2. After 1 find string that contains all capital letter.

e.g. input string and expected output will be:

So My expectation from

>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )

is to return ['AB', 'Cdef'].

But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.

So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?

Upvotes: 2

Views: 4034

Answers (2)

Rohit Jain
Rohit Jain

Reputation: 213223

First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.

Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.

So, you can use this pattern:

>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']

>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']

As pointed out in comment by @sotapme, you can also modify the above regex to: -

"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"

Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.

Upvotes: 5

Anirudha
Anirudha

Reputation: 32787

You can use this regex

[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

Upvotes: 1

Related Questions