Reputation: 27403
I can use \s?(\w+\s){0,2}\w*)
for "up to three words" and \w{0,20}
for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r)
would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
Upvotes: 1
Views: 3417
Reputation: 626748
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n]
instead of the \s
in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b
- a word boundary(?!(?:\s*\w){21})
- a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols\w+
- 1 word (consisting of 1 or more word characters)(?:\s+\w+){0,2}
- zero, one or two sequences of 1+ whitespaces followed with 1+ word characters(?=$|[\r\n])
- a positive lookahead that only allows a match to be returned if there is the end-of-string ($
) or the end of a line ([\r\n]
).Now, if your words should only contain letters, use [a-zA-Z]
or equivalent for your language. If the regex flavor allows, use \p{L}
Unicode category/property class.
Upvotes: 1