mzbaran
mzbaran

Reputation: 624

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.

Really my conditions are:

  1. there must at least be 5 words in a sequence
  2. words in sequence must be distinct
  3. sequence must be followed by some punctuation character.

So far I have tried

^(\b\w*\b\s?){5,}\s?[.?!]$

If my sample text is:

This is a sentence I would like to parse.

This is too short. 

Single word

Not not not distinct distinct words words.

Another sentence that I would be interested in. 

I would like to match on strings 1 and 5.

I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).

Upvotes: 0

Views: 1581

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

You can use the following regex to identify the strings that meet all three conditions:

^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$

with the case-indifferent flag set.

Demo

Python's regex engine performs the following operations.

^            # match beginning of line
(?!          # begin negative lookahead
  .+         # match 1+ chars
  \b(\w+)\b  # match a word in cap grp 1
  .+         # match 1+ chars
  \b\1\b     # match the contents of cap grp 1 with word breaks
)            # end negative lookahead
(?:          # begin non-cap grp
  .+         # match 1+ chars
  \b\w+\b    # match a word
)            # end non-cap grp
{5}          # execute non-cap grp 5 times
.*           # match 0+ chars
[.?!]        # match a punctuation char
\s*          # match 0+ whitespaces
$            # match end of line

Upvotes: 4

phramos07
phramos07

Reputation: 156

Items 1. and 3. are easily done by regex, but

2. words in sequence must be distinct

I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.

I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.

Edit

  1. can be done with a lookahead as Cary said.

Upvotes: 0

Related Questions