Reputation: 3433
I am looking for expressions as Vc Am in texts and for that I have
rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"
explanation:
[A-Z][a-z] = Cap followed by lower case letter
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis
import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)
What I want to achieve is exclude a list of terms like Cr or Pb.
How should I include exceptions in the expression?
thanks
Upvotes: 1
Views: 57
Reputation: 11181
First, let's shorten your RegEx:
(?<!([A-Za-z0-9]))
-> lookbehind not being a letter or number(?!([A-Za-z0-9]))(\)?)
-> look ahead not being letter or numberthese are so common there is a RegEx feature for them: Word boundaries \b
. They have zero width like lookarounds and only match if there is no alphanumeric character.
Your RegEx then becomes \b[A-Z][a-z]\b
; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?
Now you can simply use a lookbehind:
to assert that the element is neither Chrome nor Lead.
Just for fun:
Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:
C
or P
, the second letter may be any lowercase letter;C
, the second letter may not be an r
P
, the second letter may not be an b
Using character sets, this gives us:
[ABD-OQ-Z][a-z]
C[a-qs-z]
P[ac-z]
Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|
). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]
. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b
.
Upvotes: 2
Reputation: 163362
You might write the pattern without using the superfluous capture groups, and exclude matching Cr
or Pb
:
\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?
See a regex demo for the matches.
If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:
\b(?!Cr\b|Pb\b)[A-Z][a-z]\b
Explanation
\b
A word boundary to prevent a partial word match(?!
Negative lookahead
Cr\b|Pb\b
Match either Cr
or Pb
)
Close the lookahead[A-Z][a-z]
Match a single uppercase and single lowercase char\b
A word boundaryUpvotes: 2