Regex to find sentences of a minimum length

Question

I am trying to create a regular expression that finds sentences with a minimum length.

Really my conditions are:

there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.

So far I have tried

^(\b\w*\b\s?){5,}\s?[.?!]$

If my sample text is:

This is a sentence I would like to parse.

This is too short. 

Single word

Not not not distinct distinct words words.

Another sentence that I would be interested in.

I would like to match on strings 1 and 5.

I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).

Cary Swoveland · Accepted Answer

You can use the following regex to identify the strings that meet all three conditions:

^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$

with the case-indifferent flag set.

Demo

Python's regex engine performs the following operations.

^            # match beginning of line
(?!          # begin negative lookahead
  .+         # match 1+ chars
  \b(\w+)\b  # match a word in cap grp 1
  .+         # match 1+ chars
  \b\1\b     # match the contents of cap grp 1 with word breaks
)            # end negative lookahead
(?:          # begin non-cap grp
  .+         # match 1+ chars
  \b\w+\b    # match a word
)            # end non-cap grp
{5}          # execute non-cap grp 5 times
.*           # match 0+ chars
[.?!]        # match a punctuation char
\s*          # match 0+ whitespaces
$            # match end of line

Regex to find sentences of a minimum length

Answers (2)

Related Questions