NealR
NealR

Reputation: 10709

Regex to limit words with specific combination of letters (in any order)

This one is a little complicated and somewhat out of my league. I want to sort through a list of words and eliminate those that don't contain a specific set of characters, however those characters can be in any order and some may occur more than others.

I want the regex to look for any words with:

e 0 or 1 times
a 0 or 1 times
t 0 or 1 or 2 times

For example the following would work:

eat tea tate tt a e

The following would not work

eats teas tates ttt aa ee

Lookaround Regex is new to me, so I'm not 100% sure on the syntax (any answer using a lookaround with an explanation would be awesome). My best guess so far:

Regex regex = new Regex(@"(?=.*e)(?=.*a)(?=.*t)");
lines = lines.Where(x => regex.IsMatch(x)).ToArray(); //'text' is array containing words

Upvotes: 4

Views: 318

Answers (2)

user557597
user557597

Reputation:

This is probably the same as the others, I haven't formatted those to find out.

Note that assertions are coerced to match, they can't be optional
(unless specifically set optional, but what for?) and are not directly affected by backtracking.

This works, explanation is in the formatted regex.

updated
To use a whitespace boundary, use this:

(?<!\S)(?!\w*(?:e\w*){2})(?!\w*(?:a\w*){2})(?!\w*(?:t\w*){3})[eat]+(?!\S)

Formatted:

 (?<! \S )
 (?!
      \w* 
      (?: e \w* ){2}
 )
 (?!
      \w* 
      (?: a \w* ){2}
 )
 (?!
      \w* 
      (?: t \w* ){3}
 )
 [eat]+ 
 (?! \S )

To use an ordinary word boundary, use this:

\b(?!\w*(?:e\w*){2})(?!\w*(?:a\w*){2})(?!\w*(?:t\w*){3})[eat]+\b

Formatted:

 \b                     # Word boundary
 (?!                    # Lookahead, assert Not 2 'e' s
      \w* 
      (?: e \w* ){2}
 )
 (?!                    #  Lookahead, assert Not 2 'a' s
      \w* 
      (?: a \w* ){2}
 )
 (?!                    #  Lookahead, assert Not 3 't' s
      \w* 
      (?: t \w* ){3}
 )
 # At this point all the checks pass, 
 # all thats left is to match the letters.
 # -------------------------------------------------

 [eat]+                 # 1 or more of these, Consume letters 'e' 'a' or 't'
 \b                     # Word boundary

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336468

Sure:

\b(?:e(?!\w*e)|t(?!(?:\w*t){2})|a(?!\w*a))+\b

Explanation:

\b             # Start of word
(?:            # Start of group: Either match...
 e             # an "e",
 (?!\w*e)      # unless another e follows within the same word,
|              # or
 t             # a "t",
 (?!           # unless...
  (?:\w*t){2}  # two more t's follow within the same word,
 )             # 
|              # or
 a             # an "a"
 (?!\w*a)      # unless another a follows within the same word.
)+             # Repeat as needed (at least one letter)
\b             # until we reach the end of the word.

Test it live on regex101.com.

(I've used the \w character class for simplicity's sake; if you want to define your allowed "word characters" differently, replace this accordingly)

Upvotes: 3

Related Questions