bumbeishvili
bumbeishvili

Reputation: 1460

Regex - match words which only consists of certain characters and certain char is repeated certain times

I have words database containing of 300 000+ words

I want to match words which length is known (7 for example) and it contains only certain characters and some of them can be repeated certain times but not all of them

for example

I have a,p,p,l,e,r,t,h,o chars and I want to find words which length is 5

so, it can match

apple
earth

but not

hello because l is specified more than one time

My attempts

 ^([a,p,p,l,e,r,t,h,o]{1}) # capture first char 

 (!/1 [a,p,p,l,e,r,t,h,o]{1}) # capture second char but without firstly captured symbol

 (!/1 !/2 [a,p,p,l,e,r,t,h,o]{1}) # capture third char but without first and second captured symbol

and so on  ...

Upvotes: 0

Views: 378

Answers (2)

Casperah
Casperah

Reputation: 4554

I know that this is not a regex solution to the problem, but sometimes regex isn't the solution.

public class WordChecker
{
    public WordChecker(params char[] letters)
    {
        Counters = letters.GroupBy(c => c).ToDictionary(g => g.Key, g => new Counter(g.Count()));
    }
    public WordChecker(string letters) : this(letters.ToArray())
    {
    }

    public bool CheckWord(string word)
    {
        Initialize();
        foreach (var c in word)
        {
            Counter counter;
            if (!Counters.TryGetValue(c, out counter)) return false;
            if (!counter.Add()) return false;
        }
        return true;
    }

    private void Initialize()
    {
        foreach (var counter in Counters)
            counter.Value.Initialize();
    }
    private Dictionary<char, Counter> Counters;
    private class Counter
    {
        public Counter(int maxCount)
        {
            MaxCount = maxCount;
            Count = 0;
        }
        public void Initialize()
        {
            Count = 0;
        }
        public bool Add()
        {
            Count++;
            return Count <= MaxCount;
        }
        public int MaxCount { get; private set; }
        public int Count { get; private set; }
    }
}

And the way to use it is like this:

    WordChecker checker = new WordChecker("applertho");
    List<string> words = new List<string>(){"apple", "giraf", "earth", "hello"};
    foreach (var word in words)
        if (checker.CheckWord(word))
        {
            // The word is valid!
        }

Upvotes: 0

Valdi_Bo
Valdi_Bo

Reputation: 30971

Try the following regex:

\b(?!\w*([alertho])\w*\1)(?!\w*([p])(\w*\2){2})[aplertho]{5}\b

Details:

  • \b - Word boundary (opening).
  • (?!\w*([alertho])\w*\1) - Negative lookahead, test for more than 1 occurrence of the mentioned chars):
    • some word chars (optional),
    • one of chars allowed to occur once (capturing groupp #1),
    • some word chars (optional),
    • the same char as captured by groupp #1.
  • (?!\w*([p])(\w*\2){2}) - Negative lookahead, test for occurrence more than 2 times. Like before, but this time:
    • the capturing group has No 2,
    • the set of chars allowed contains only p,
    • this lookahead "fires" if the char captured by groupp #2 occured two times thereafter.
  • [aplertho]{5} - What we are looking for - any of the allowed chars, 5 occurrences.
  • \b - Word boundary (closing).

Upvotes: 2

Related Questions