Tobias Kienzler
Tobias Kienzler

Reputation: 27403

What's a RegEx for "up to three words but no more than 20 characters"?

I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.

Some examples for clarification:

The early bird catches the worm.

should match any three words in sequence (including the worm*).

Here we have a supercalifragilisticexpialidocious sentence.

"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.


* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.

Upvotes: 1

Views: 3417

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

To match 3 words separated with whitespace(s) at the end of a line or string, you can use

\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])

See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.

Regex explanation

  • \b - a word boundary
  • (?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
  • \w+ - 1 word (consisting of 1 or more word characters)
  • (?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
  • (?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).

Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.

Upvotes: 1

Related Questions