GradviusMars
GradviusMars

Reputation: 91

Regex to match 4 letters in a string

I am trying to write some regex that will match a string that contains 4 or more letters in it that are not necessarily in sequence.

The input string can have a mix of upper and lowercase letters, numbers, non-alpha chars etc, but I only want it to pass the regex test if it contains at least 4 upper or lowercase letters.

An example of what I would like to be a valid input can be seen below:

a124Gh0st

I have currently written this piece of regex:

    (?(?=[a-zA-Z])([a-zA-Z])| )

Which returns 5 matches successfully but it will currently always pass as long as I have greater than 1 letter in the input string. if I add {4,} to the end of it then it works, but only in situations where there are 4 letters in a row.

I am using the following website to test what I have been doing: regex101

Any help on this would be greatly appreciated.

Upvotes: 5

Views: 19852

Answers (2)

John Bollinger
John Bollinger

Reputation: 181159

Why don't you just match the zero or more characters between each letter? For example,

(?:[A-Za-z].*){4}

You'll recognize the [A-Za-z]. The . matches any character, so .* is a run of any number (including zero) of any character. The group of a letter followed by any number of any characters is repeated four times, so this pattern matches if and only if at least four letters appear in the string. (Note that the trailing .* of the fourth repeat of the pattern is mostly inconsequential, since it can match zero characters).

If you are using a regex language that supports reluctant quantifiers, then using them will make this pattern considerably more efficient. For example, in Java or Perl, one might prefer to use

    (?:[A-Za-z].*?){4}

The .*? still matches any number of any character, but the matching algorithm will match as few characters as possible with each such run. This will reduce the amount of backtracking it needs to perform. For this particular pattern, it will reduce the needed backtracking to zero.

If you do not have reluctant quantifiers in your regex dialect, then you can achieve the same desirable effect a bit more verbosely:

(?:[A-Za-z][^A-Za-z]*?){4}

There, only non-letters are matched for the runs between letters.

Even with this, the pattern uses some regex features not present in all regex flavors -- non-capturing groups, enumerated quantifiers -- but these are present in your original regex. For a maximally-compatible form, you might write

[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z]

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627291

You may use

(?s)^([^a-zA-Z]*[A-Za-z]){4}.* 

or

^([^a-zA-Z]*[A-Za-z]){4}[\s\S]*

See the regex demo.

Details:

  • ^ - start of string
  • ([^a-zA-Z]*[A-Za-z]){4} - exactly 4 sequences of:
    • [^a-zA-Z]* - 0+ chars other than ASCII letters
    • [A-Za-z] - an ASCII letter
  • [\S\s]* - any 0+ chars (same as .* if the DOTALL modifier is enabled).

Upvotes: 7

Related Questions