alanmanderson
alanmanderson

Reputation: 8210

php regex negative lookahead

I have a dictionary of 4 letter words. I want to write a regex to go through the dictionary and matches all words given a set of letters.

Suppose I pass in a,b,l,l. I want to find all words with exactly those letters.

I know I could do /[abl]{4}/ but that will also match words with 2 a's or 2 b's.

I feel like I need to do a negative look ahead. Something like:

[l|(ab)(?!\1)]{4} 

The attempt here is that I want a word that starts with l or a or b and not followed by a or b.

Upvotes: 1

Views: 384

Answers (3)

user557597
user557597

Reputation:

Edit: So for 47 letters it would be

\b(?:((?(1)(?!))l1)|((?(2)(?!))l2)|...|((?(47)(?!))l47)){47}\b

Letters can be duplicates, say 4 a's and 15 r's (but no more), etc ...
( immune to permutations )


To match out of order items only once,
use a conditional to allow each item to match once,
but no more.

It's not complicated, and is immune to permutations.

Works every time !

\b(?:((?(1)(?!))a)|((?(2)(?!))b)|((?(3)(?!))l)|((?(4)(?!))l)){4}\b

Expanded

 \b 
 (?:
      (                             # (1)
           (?(1)(?!))
           a 
      )

   |  
      (                             # (2)
           (?(2)(?!))
           b 
      )
   |  
      (                             # (3)
           (?(3)(?!))
           l 
      )
   |  
      (                             # (4)
           (?(4)(?!))
           l 
      )
 ){4}
 \b  

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89566

First thing you need to anchor your pattern to describe where the string begins and ends:

for a whole string (^ start of the string, $ end of the string):

^[abl]{4}$

or to find words in a larger text, use word-boundaries (limit between a character from [A-Za-z0-9_] and something else):

\b[abl]{4}\b

Then you need to say that l must occur two times (or that a and b must occurs only one time, but it's more complicated):

for a whole string:

^(?=.*l.*l)[abl]{4}$ 

in a larger text:

\b(?=\w*l\w*l)[abl]{4}\b

To avoid two a or b, you can use an other lookahead:

for a whole string:

^(?=.*l.*l)(?=l*al*b|l*bl*a)[abl]{4}$ 

in a larger text:

\b(?=\w*l\w*l)(?=l*al*b|l*bl*a)[abl]{4}\b

About [l|(ab)(?!\1)]: in a character class, special regex characters or sequence of characters loose their special meaning and all characters are seen as literals. So [l|(ab)(?!\1)] is the same than [)(!|?1abl] for example. (Since \1 is an unknown escape sequence in a character class, the backslash is ignored.)

Note that with several constraints the pattern becomes quickly ugly. You should consider an other approach that consists to catch all words with \b[abl]{4}\b and to filter them in a second time (using count_chars for example).

$str ='abll labl ball aabl lblabla 1234';

$dict = 'abll';
$count = count_chars($dict);

$result = [];
if (preg_match_all('~\b[abl]{4}\b~', $str, $matches)) {
    $result = array_filter($matches[0], function ($i) use ($count) {
        return $count == count_chars($i);
    });
}

print_r($result);

Upvotes: 2

Alexander Borisov
Alexander Borisov

Reputation: 401

If you want specify letters dynamically and then generate regexp that will do all work - this will be a very expensive work.

Simple approach: you can generate simple regexp like /^[abl]{4}$/, get all words from dictionary that match him and then validate each word separately - check letters quantity.

More efficient approach: you can index your words in dictionary with sorted list of letters like this:

word: apple | index: aelpp

word: pale | index: aelp

And so on. To get all words from list of letters you simply should sort this letters and find exact match with "index" value.

Upvotes: 0

Related Questions