dav
dav

Reputation: 9267

php regex - find uppercase string with number and spaces in text

I want to write php regular expression to find uppercase string , which can also contain one number and spaces, from text.

For example from this text "some text to contain EXAM PL E 7STRING uppercase word" I want to get string- EXAM PL E 7STRING ,

found string should start and end only with uppercase, but in the middle, without uppercase letters can also contain(but not necessarily ) one number and spaces. So, regex should match any of these patterns

1) EXAMPLESTRING               - just uppercase string
2) EXAMP4LESTRING              - with number
3) EXAMPLES TRING              - with space
4) EXAM PL E STRING            - with more than one spaces
5) EXAMP LE4STRING             - with number and space
6) EXAMP LE 4ST RI NG          - with number and spaces 

and with total length string should be equal or more than 4 letters

I wrote this regex '/[A-Z]{1,}([A-Z\s]{2,}|\d?)[A-Z]{1,}/', that can find first 4 patterns, but I can not figure it out to match also the last 2 patterns.

Thanks

Upvotes: 2

Views: 1649

Answers (2)

Ωmega
Ωmega

Reputation: 43673

I suggest to use regex pattern

[A-Z][ ]*(\d)?(?(1)(?:[ ]*[A-Z]){3,}|[A-Z][ ]*(\d)?(?(2)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(\d)?(?(3)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(?:\d|(?:[ ]*[A-Z])+[ ]*\d?))))(?:[ ]*[A-Z])*

(see this demo).

[A-Z][ ]*(?:\d(?:[ ]*[A-Z]){2}|[A-Z][ ]*\d[ ]*[A-Z]|(?:[A-Z][ ]*){2,}\d?)[A-Z ]*[A-Z]

(see this demo)

Upvotes: 2

Martin Ender
Martin Ender

Reputation: 44259

There is a neat trick called a lookahead. It just checks what is following after the current position. That can be used to check for multiple conditions:

'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])(?!(?:[A-Z\s]*\d){2})[A-Z][A-Z\s\d]*[A-Z]/'

The first lookaround is actually a lookbehind and checks that there is no previous uppercase letter. This is just a little speedup for strings that would fail the match anyway. The second lookaround (a lookahead) checks that there are at least four letters. The third one checks that there are no two digits. The rest just matches then a string of the allowed characters, starting and ending with an uppercase letter.

Note that in the case of two digits this will not match at all (instead of matching everything up to the second digit). If you do want to match in such a case, you could incorporate the "1 digit" rule into the actual match instead:

'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])[A-Z][A-Z\s]*\d?[A-Z\s]*[A-Z]/'

EDIT:

As Ωmega pointed out, this will cause problems if there are less then four letters before the second digit, but more after that. This is actually quite tough, because the assertion needs to be, that there are more than 4 letters before the second digit. Since we do not know where the first digit occurs in those four letters, we have to check for all possible positions. For this I would do away with the lookaheads altogether, and simply provide the three different alternatives. (I will keep the lookbehind as an optimization for non-matching parts.)

'/(?<![A-Z])[A-Z]\s*(?:\d\s*[A-Z]\s*[A-Z]|[A-Z]\s*\d\s*[A-Z]|[A-Z]\s*[A-Z][A-Z\s]*\d?)[A-Z\s]*[A-Z]/'

Or here with added comments:

'/
(?<!         # negative lookbehind
    [A-Z]    # current position is not preceded by a letter
)            # end of lookbehind
[A-Z]        # match has to start with uppercase letter
\s*          # optional spaces after first letter
(?:          # subpattern for possible digit positions
    \d\s*[A-Z]\s*[A-Z]
             # digit comes after first letter, we need two more letters before last one
|            # OR
    [A-Z]\s*\d\s*[A-Z]
             # digit comes after second letter, we need one more letter before last one
|            # OR
    [A-Z]\s*[A-Z][A-Z\s]*\d?
             # digit comes after third letter, or later, or not at all
)            # end of subpattern for possible digit positions
[A-Z\s]*     # arbitrary amount of further letters and whitespace
[A-Z]        # match has to end with uppercase letter
/x'

That gives the same result on Ωmega's lengthy test input.

Upvotes: 5

Related Questions