Hayley van Waas
Hayley van Waas

Reputation: 439

Regex statement to check for three capital letters

I need a regex statement that will check for three capital letters in a row.

For example, it should match: ABC, aABC, abcABC

But it should not match: AaBbBc, ABCDE

At the moment this is my statement:

'[^A-Z]*[A-Z]{3}[^A-Z]*'

But this matches ABCDE. What am I doing wrong?

Upvotes: 3

Views: 7074

Answers (5)

Alvin S. Lee
Alvin S. Lee

Reputation: 5182

In your regular expression, the [^A-Z]* at the beginning and end is saying "Look for any number of non-capital letters, including 0." And so, ABCDE will satisfy your regular expression. For example, A can be seen as "0 non-capital letters" followed by BCD followed by E, which is also "0 non-capital letters."

I think what you want to do instead is craft a regular expression that looks for:

  1. Either "a non-capital letter" or the start of my string.
  2. Followed by "exactly 3 capital letters."
  3. Followed by "a non-capital letter" or the end of my string.

It doesn't matter how many non-capital letters precede or follow your 3 capital letters, as long as there is at least 1. So, you just need to look for 1.

Try this:

(^|[^A-Z])[A-Z]{3}([^A-Z]|$)

Note that the first ^ means start of string, which is different from the meaning of ^ inside the brackets. The $ means end of string.

Tested in ruby, here is what we have:

regexp = /(^|[^A-Z])[A-Z]{3}([^A-Z]|$)/
'ABC'.match(regexp)    # returns a match
'aABC'.match(regexp)   # returns a match
'abcABC'.match(regexp)  # returns a match
'AaBbBc'.match(regexp) # returns nil
'ABCDE'.match(regexp)  # returns nil

Upvotes: 3

Jerry
Jerry

Reputation: 71538

You have to keep in mind that when you're using regexes, they will try as much as they can to get a match (that is also one of the biggest weakness of regex and this is what often causes catastrophic backtracking). What this implies is that in your current regex:

[^A-Z]*[A-Z]{3}[^A-Z]*

[A-Z]{3} is matching 3 uppercase letters, and both [^A-Z]* are matching nothing (or empty strings). You can see how by using capture groups:

import re
theString = "ABCDE"
pattern = re.compile(r"([^A-Z]*)([A-Z]{3})([^A-Z]*)")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group(0) + "}")
    print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")
else:
    print("No match")

Prints:

Matched string: {ABC}
Sub match 1: {} 2. {ABC} 3. {}

ideone demo

Do you see what happened now? Since [^A-Z]* can also accept 'nothing', that's exactly what it'll try to do and match an empty string.

What you probably wanted was to use something more like this:

([^A-Z]|^)[A-Z]{3}([^A-Z]|$)

It will match a string containing three consecutive uppercase letters when there is no more uppercase letters around it (the |^ means OR at the beginning and |$ means OR at the end). If you use that regex in the little script above, you will not get any match in ABCDE which is what you wanted. If you use it on the string abcABC, you get:

import re
theString = "abcABC"
pattern = re.compile(r"([^A-Z]|^)([A-Z]{3})([^A-Z]|$)")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group(0) + "}")
    print("Sub match 1: {" + result.group(1) + "} 2. {" + result.group(2) + "} 3. {" + result.group(3) + "}")

Prints:

Matched string: {cABC}
Sub match 1: {c} 2. {ABC} 3. {}

The [^A-Z] is actually matching (or in better regex terms, consuming) a character and if you only care about checking whether or not the string contains only 3 uppercase characters in a row, that regex would suffice.


If you want to extract those uppercase characters, you can use a capture group like in the above example and use result.group(2) to get it.

Actually, if you turn some capture groups into non-capture groups...

(?:[^A-Z]|^)([A-Z]{3})(?:[^A-Z]|$)

You can use result.group(1) to get those 3 letters

Otherwise, if you don't mind using lookarounds (they can be a little harder to understand), you won't have to use capture groups. Vasili's answer shows exactly how you use them:

(?<![A-Z])[A-Z]{3}(?![A-Z])

(?<! ... ) is a negative lookbehind and will prevent a match if the pattern inside matches the previous character(s). In this case, if the previous character matches [A-Z] the match will fail.

(?! ... ) is a negative lookahead and will prevent a match if the pattern inside matches the next character(s). In this case, if the next character matches [A-Z] the match will fail. In this case, you can simply use .group() to get those uppercase letters:

import re
theString = "abcABC"
pattern = re.compile(r"(?<![A-Z])[A-Z]{3}(?![A-Z])")
result = pattern.search(theString)

if result:
    print("Matched string: {" + result.group() + "}")

ideone demo

I hope it wasn't too long :)

Upvotes: 2

Jassi
Jassi

Reputation: 541

Your regex has all explanation that what you are doing wrong

'[^A-Z]*[A-Z]{3}[^A-Z]*'

If ^ is used inside character set i.e [] which means ignore, so your regex would ignore if it starts A-Z (capital letters) either one or more at the starting. But as per your example, I think you don't want that

[A-Z]{3} means it will exactly match three capital letters in a row.

[^A-Z]* means the same what I explained for the first one.

If you write '[A-Z]{3}' only, it would match exactly first three consecutive capital letters at anywhere in the string.

It would match ABCde abCDE aBCDe ABCDE but it would not match abcDE ABcDE AaBcCc

Just try it.

Example in Perl

#!/usr/bin/perl
use strict;
use warnings;

my @arr = qw(AaBsCc abCDE ABCDE AbcDE abCDE ABC aABC abcABC);

foreach my $string(@arr){
  if($string =~ m/[A-Z]{3}/){
    print "Matched $string\n";
  }
  else {
    print "Didn't match $string \n";
  }
}

Output:

Didn't match AaBsCc
Matched abCDE
Matched ABCDE
Didn't match AbcDE
Matched abCDE
Matched ABC
Matched aABC
Matched abcABC

Upvotes: 0

MIE
MIE

Reputation: 454

You can use this:

    '^(?:.*[^A-Z])?[A-Z]{3}(?:[^A-Z].*)?$'

Explanation:

  • ^,$ to match start and end of line.
  • (?:.*[^A-Z])? to check that the previous character is not capital (if any).

Upvotes: 0

Vasili Syrakis
Vasili Syrakis

Reputation: 9591

Regex

(?<![A-Z])[A-Z]{3}(?![A-Z])

Explanation

I specified a negative lookbehind and a negative lookahead before and after the middle regex for three capitals in a row, respectively.

This is a better option compared to using a negated character class because it will successfully match even when there are no characters to the left or right of the string.

Online Demonstration

DEMO


As for the Python code, I haven't figured out how to print out the actual matches, but this is the syntax:

Using re.match:

>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = '''ABC
... aABC
... abcABCabcABCDabcABCDEDEDEDa
... ABCDE'''
>>> result = p.match(s)
>>> result.group()
'ABC'

Using re.search:

>>> import re
>>> p = re.compile(r'(?<![A-Z])[A-Z]{3}(?![A-Z])')
>>> s = 'ABcABCde'
>>> p.search(s).group()
'ABC'

Upvotes: 5

Related Questions