Reputation: 3433
I am using python's re.findall method to find occurrence of certain string value in Input string. e.g. From search in 'ABCdef' string, I have two search requirements.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef']
.
But which does Not seems to be happening. What I get is ['ABC']
as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef'
is matched with '[A-Z][a-z]+'
. second part of regex (i.e. '[A-Z]+'
) only matches with remaining string 'AB'
?
Upvotes: 2
Views: 4034
Reputation: 213223
First you need to match AB
, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead
.
Then you need to match an Uppercase alphabet C
, followed by multiple lowercase alphabets def
.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by @sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+
since you also want to match digit as in one of your example. Also, he removed [a-z]
part from the first part of look-ahead. That works because, +
quantifier on the [A-Z]
outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case
alphabet.
Upvotes: 5
Reputation: 32787
You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)
Upvotes: 1