Reputation: 8210
I have a dictionary of 4 letter words. I want to write a regex to go through the dictionary and matches all words given a set of letters.
Suppose I pass in a,b,l,l
. I want to find all words with exactly those letters.
I know I could do /[abl]{4}/
but that will also match words with 2 a's or 2 b's.
I feel like I need to do a negative look ahead. Something like:
[l|(ab)(?!\1)]{4}
The attempt here is that I want a word that starts with l or a or b and not followed by a or b.
Upvotes: 1
Views: 384
Reputation:
Edit: So for 47 letters it would be
\b(?:((?(1)(?!))l1)|((?(2)(?!))l2)|...|((?(47)(?!))l47)){47}\b
Letters can be duplicates, say 4 a's and 15 r's (but no more), etc ...
( immune to permutations )
To match out of order items only once,
use a conditional to allow each item to match once,
but no more.
It's not complicated, and is immune to permutations.
Works every time !
\b(?:((?(1)(?!))a)|((?(2)(?!))b)|((?(3)(?!))l)|((?(4)(?!))l)){4}\b
Expanded
\b
(?:
( # (1)
(?(1)(?!))
a
)
|
( # (2)
(?(2)(?!))
b
)
|
( # (3)
(?(3)(?!))
l
)
|
( # (4)
(?(4)(?!))
l
)
){4}
\b
Upvotes: 0
Reputation: 89566
First thing you need to anchor your pattern to describe where the string begins and ends:
for a whole string (^
start of the string, $
end of the string):
^[abl]{4}$
or to find words in a larger text, use word-boundaries (limit between a character from [A-Za-z0-9_]
and something else):
\b[abl]{4}\b
Then you need to say that l
must occur two times (or that a
and b
must occurs only one time, but it's more complicated):
for a whole string:
^(?=.*l.*l)[abl]{4}$
in a larger text:
\b(?=\w*l\w*l)[abl]{4}\b
To avoid two a or b, you can use an other lookahead:
for a whole string:
^(?=.*l.*l)(?=l*al*b|l*bl*a)[abl]{4}$
in a larger text:
\b(?=\w*l\w*l)(?=l*al*b|l*bl*a)[abl]{4}\b
About [l|(ab)(?!\1)]
: in a character class, special regex characters or sequence of characters loose their special meaning and all characters are seen as literals. So [l|(ab)(?!\1)]
is the same than [)(!|?1abl]
for example. (Since \1
is an unknown escape sequence in a character class, the backslash is ignored.)
Note that with several constraints the pattern becomes quickly ugly. You should consider an other approach that consists to catch all words with \b[abl]{4}\b
and to filter them in a second time (using count_chars
for example).
$str ='abll labl ball aabl lblabla 1234';
$dict = 'abll';
$count = count_chars($dict);
$result = [];
if (preg_match_all('~\b[abl]{4}\b~', $str, $matches)) {
$result = array_filter($matches[0], function ($i) use ($count) {
return $count == count_chars($i);
});
}
print_r($result);
Upvotes: 2
Reputation: 401
If you want specify letters dynamically and then generate regexp that will do all work - this will be a very expensive work.
Simple approach: you can generate simple regexp like /^[abl]{4}$/
, get all words from dictionary that match him and then validate each word separately - check letters quantity.
More efficient approach: you can index your words in dictionary with sorted list of letters like this:
word: apple | index: aelpp
word: pale | index: aelp
And so on. To get all words from list of letters you simply should sort this letters and find exact match with "index" value.
Upvotes: 0