Reputation: 439
I need a regex statement that will check for three capital letters in a row.
For example, it should match: ABC, aABC, abcABC
But it should not match: AaBbBc, ABCDE
At the moment this is my statement:
'[^A-Z]*[A-Z]{3}[^A-Z]*'
But this matches ABCDE. What am I doing wrong?
Upvotes: 3
Views: 7074
Reputation: 5182
In your regular expression, the [^A-Z]*
at the beginning and end is saying "Look for any number of non-capital letters, including 0." And so, ABCDE
will satisfy your regular expression. For example, A
can be seen as "0 non-capital letters" followed by BCD
followed by E
, which is also "0 non-capital letters."
I think what you want to do instead is craft a regular expression that looks for:
It doesn't matter how many non-capital letters precede or follow your 3 capital letters, as long as there is at least 1. So, you just need to look for 1.
Try this:
(^|[^A-Z])[A-Z]{3}([^A-Z]|$)
Note that the first ^
means start of string, which is different from the meaning of ^
inside the brackets. The $
means end of string.
Tested in ruby, here is what we have:
regexp = /(^|[^A-Z])[A-Z]{3}([^A-Z]|$)/
'ABC'.match(regexp) # returns a match
'aABC'.match(regexp) # returns a match
'abcABC'.match(regexp) # returns a match
'AaBbBc'.match(regexp) # returns nil
'ABCDE'.match(regexp) # returns nil
Upvotes: 3
Reputation: 71538
You have to keep in mind that when you're using regexes, they will try as much as they can to get a match (that is also one of the biggest weakness of regex and this is what often causes catastrophic backtracking). What this implies is that in your current regex:
[^A-Z]*[A-Z]{3}[^A-Z]*
[A-Z]{3}
is matching 3 uppercase letters, and both [^A-Z]*
are matching nothing (or empty strings). You can see how by using capture groups:
import re
theString = "ABCDE"
pattern = re.compile(r"([^A-Z]*)([A-Z]{3})([^A-Z]*)")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group(0) + "}")
print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
else:
print("No match")
Prints:
Matched string: {ABC}
Sub match 1: {} 2. {ABC} 3. {}
Do you see what happened now? Since [^A-Z]*
can also accept 'nothing', that's exactly what it'll try to do and match an empty string.
What you probably wanted was to use something more like this:
([^A-Z]|^)[A-Z]{3}([^A-Z]|$)
It will match a string containing three consecutive uppercase letters when there is no more uppercase letters around it (the |^
means OR at the beginning and |$
means OR at the end). If you use that regex in the little script above, you will not get any match in ABCDE
which is what you wanted. If you use it on the string abcABC
, you get:
import re
theString = "abcABC"
pattern = re.compile(r"([^A-Z]|^)([A-Z]{3})([^A-Z]|$)")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group(0) + "}")
print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
Prints:
Matched string: {cABC}
Sub match 1: {c} 2. {ABC} 3. {}
The [^A-Z]
is actually matching (or in better regex terms, consuming) a character and if you only care about checking whether or not the string contains only 3 uppercase characters in a row, that regex would suffice.
If you want to extract those uppercase characters, you can use a capture group like in the above example and use result.group(2)
to get it.
Actually, if you turn some capture groups into non-capture groups...
(?:[^A-Z]|^)([A-Z]{3})(?:[^A-Z]|$)
You can use result.group(1)
to get those 3 letters
Otherwise, if you don't mind using lookarounds (they can be a little harder to understand), you won't have to use capture groups. Vasili's answer shows exactly how you use them:
(?<![A-Z])[A-Z]{3}(?![A-Z])
(?<! ... )
is a negative lookbehind and will prevent a match if the pattern inside matches the previous character(s). In this case, if the previous character matches [A-Z]
the match will fail.
(?! ... )
is a negative lookahead and will prevent a match if the pattern inside matches the next character(s). In this case, if the next character matches [A-Z]
the match will fail. In this case, you can simply use .group()
to get those uppercase letters:
import re
theString = "abcABC"
pattern = re.compile(r"(?<![A-Z])[A-Z]{3}(?![A-Z])")
result = pattern.search(theString)
if result:
print("Matched string: {" + result.group() + "}")
I hope it wasn't too long :)
Upvotes: 2
Reputation: 541
Your regex has all explanation that what you are doing wrong
'[^A-Z]*[A-Z]{3}[^A-Z]*'
If ^ is used inside character set i.e [] which means ignore, so your regex would ignore if it starts A-Z (capital letters) either one or more at the starting. But as per your example, I think you don't want that
[A-Z]{3} means it will exactly match three capital letters in a row.
[^A-Z]* means the same what I explained for the first one.
If you write '[A-Z]{3}' only, it would match exactly first three consecutive capital letters at anywhere in the string.
It would match ABCde abCDE aBCDe ABCDE but it would not match abcDE ABcDE AaBcCc
Just try it.
Example in Perl
#!/usr/bin/perl
use strict;
use warnings;
my @arr = qw(AaBsCc abCDE ABCDE AbcDE abCDE ABC aABC abcABC);
foreach my $string(@arr){
if($string =~ m/[A-Z]{3}/){
print "Matched $string\n";
}
else {
print "Didn't match $string \n";
}
}
Output:
Didn't match AaBsCc
Matched abCDE
Matched ABCDE
Didn't match AbcDE
Matched abCDE
Matched ABC
Matched aABC
Matched abcABC
Upvotes: 0
Reputation: 454
You can use this:
'^(?:.*[^A-Z])?[A-Z]{3}(?:[^A-Z].*)?$'
Explanation:
^
,$
to match start and end of line.(?:.*[^A-Z])?
to check that the previous character is not capital (if any).Upvotes: 0
Reputation: 9591
(?<![A-Z])[A-Z]{3}(?![A-Z])
I specified a negative lookbehind and a negative lookahead before and after the middle regex for three capitals in a row, respectively.
This is a better option compared to using a negated character class because it will successfully match even when there are no characters to the left or right of the string.
As for the Python code, I haven't figured out how to print out the actual matches, but this is the syntax:
Using re.match
:
>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = '''ABC
... aABC
... abcABCabcABCDabcABCDEDEDEDa
... ABCDE'''
>>> result = p.match(s)
>>> result.group()
'ABC'
Using re.search
:
>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = 'ABcABCde'
>>> p.search(s).group()
'ABC'
Upvotes: 5