vladakolic
vladakolic

Reputation: 290

Regex: Match specific characters in any order without more occurrences of each character than specified

I have a list of characters, e.g. {o, b, c, c, d, o, f}.

If a string contains characters that are not in that list, I don't want it to be a match. If a string contains more occurrences of a character than there are occurrences of that character in that list, I don't want it to be a match.

The characters in the string may occur in any order, and all characters don't have to appear. In the above example "foo" should be a match but not "fooo".

I have for instance narrowed the above example down to (o{0,2}b?c{0,2}d?f?), but that doesn't quite work since the order matters in that regex. I get a match for "oof" but not for "foo".

Upvotes: 12

Views: 11567

Answers (4)

There is an algorithm that allows converting an arbitrary "parallel" regex to a "usual one".

See the work Parallel Finite Automata for Modeling ConcurrentSoftware Systems that proves this equivalence as well as provides an algorithm for converting PFA to DFA, and the regex-like syntax for the interleaving expressions. In that syntax, your pattern is encoded as (o||o||b||c||c||d||f). An algorithm provided in the referenced paper constructs an NFA from any PFA, thus allowing one to rewrite the regex with interleaving as a "normal" regex, though it might be a bit longer than comfortable.

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89614

As gview says, regex is not the right tool. However, if your regex engine supports lookahead, you can use this:

^(?=(?:[^o]*o){0,2}[^o]*$)(?=(?:[^c]*c){0,2}[^c]*$)(?=[^b]*b?[^b]*$)(?=[^d]*d?[^d]*$)(?=[^f]*f?[^f]*$)[obcdf]+$

Its a bit long but very simple:

The string is matched with ^[obcdf]+$ (note the use of anchors).

The lookaheads (?=...) are only checks (followed by):

(?=(?:[^o]*o){0,2}[^o]*$)   # no more than 2 o until the end

(?=[^b]*b?[^b]*$) # no more than 1 b until the end

Each subpattern in lookaheads describes the whole string.

Upvotes: 11

user557597
user557597

Reputation:

Another way might work as well

 # ^(?!(?:.*o){3})(?!(?:.*c){3})(?!(?:.*b){2})(?!(?:.*d){2})(?!(?:.*f){2})[obcdf]+$

 ^                 # BOS
 (?! (?:.* o){3} ) # not more than 2 'o'
 (?! (?:.* c){3} ) # not more than 2 'c'
 (?! (?:.* b){2} ) # not more than 1 'b'
 (?! (?:.* d){2} ) # not more than 1 'd'
 (?! (?:.* f){2} ) # not more than 1 'f'
 [obcdf]+          # can only be these
 $                 # EOS

Upvotes: 5

gview
gview

Reputation: 15391

I don't think that regex is the right tool for that requirement. I would create a simple array with a count of the characters in the whitelist. If your language has associative arrays then key by the letter and have the number of occurrences in the array element.

Then process the word character by character, attempting a match in the associative array, and decrementing the available count.

It fails, if either:

  • you don't have a match for a letter in your array
  • you match but there isn't a count left for the matched letter.

Upvotes: 6

Related Questions