Reputation: 63
I have the following string:
SEDCVBNT S800BG09 7GFHFGD6H 324235346 RHGF7U S8-00BG/09 7687678
and the following regex:
preg_match_all('/\b(?=.+[0-9])(?=.+[A-Z])[A-Z0-9-\/]{4,20}/i', $string, $matches)
What I'm trying to achieve is to return all of the whole "words" that:
/
-
Unfortunately, the above regex returns purely alphabetical and purely numeric words as well:
Array (
[0] => Array (
[0] => SEDCVBNT
[1] => S800BG09
[2] => 7GFHFGD6H
[3] => 324235346
[4] => RHGF7U
[5] => S8-00BG/09
)
)
I don't want SEDCVBNT
or 324235346
to be returned.
Upvotes: 2
Views: 185
Reputation: 47874
Word boundary markers (\b
) cannot be relied upon for identifying the edges of a "word" for this task because, for one example, a word ending in slash followed by a space will not satisfy a word boundary. A word boundary is only appropriate when determining the zero-width position between a \w
and \W
(and vice versa).
Code: (Demo)
$string = 'SEDCVBNT S800BG09 7GFHFGD6H 324235346 RHGF7U S8-00BG/09 7687678';
preg_match_all(
'~
(?:^|\s) #match start of string or whitespace
\K #release previously matched characters
(?=\S*[a-z]) #lookahead for zero or more visible characters followed by letter
(?=\S*\d) #lookahead for zero or more visible characters followed by number
[a-z\d/-]+ #match one or more consecutive whitelisted characters
(?=\s|$) #lookahead for a whitespace or the end of string
~xi', #ignore literal whitespaces in pattern, use case-insensitivity with letters
$string,
$m
);
var_export($m);
Upvotes: 0
Reputation: 437336
You need slightly advanced regex syntax for this one.
The regex I came up with is
(?<=\s|^)(?=[\w/-]*\d[\w/-]*)(?=[\w/-]*[A-Za-z][\w/-]*)([\w/-])+(?=\s|$)
Let's explain it:
[\w/-]
comes up a lot; this means "any word character (which includes letters, digits, accented letters etc) or a slash or a dash" -- effectively, all characters that you consider to be part of a valid token.(?=[\w/-]*\d[\w/-]*)
.(?=\s|$)
) and negative (at the beginning: (?<=\s|^)
) lookahead to make sure that a match is only made if the whole text token begins after a whitespace character or is at the beginning of the input string (\s|^
) and is followed by with a whitespace character or terminates the input string (\s|$
).([\w/-])+
, in effect I 'm using them to only match text that matches multiple patterns: both of the lookaheads and the capture group pattern at the end.\d
).A-Za-z
)./
and -
.Therefore, for the capture group to match, the text being examined must:
/
and -
(capturing group).Which is exactly what you require. :)
Note: refiddle.com seems to not play well with negative lookbehind, so the regexp after the link does not include the initial (?<=\s|^)
part. This means that it will erroneously match the DEF456
in ABC123$DEF456
.
Upvotes: 2
Reputation: 27913
Here is the raw regex: \b(?=\S*?\d)(?=\S*?[a-z])\S+?(?=$|\s)
preg_match_all('/\b(?=\S*?\d)(?=\S*?[a-z])\S+?(?=$|\s)/i', $string, $matches)
Upvotes: -1