code veda
code veda

Reputation: 125

preg_match() function in PHP returns improper result

$sRangeRegex = '/^(.{0,30})?$/';
$value='12345678901234567890123456789ä';
if (!preg_match($sRangeRegex, $value)) {
    alert('not match');
}

When i run this code it returns 'not match' alert message. But actually it shouldn't be. Because actual length of value should be 30 (number of characters in the $value) but it shows 31 These umlaut characters are creating problem while matching. So i want solution to solve this problem and with regex only. Thanks.

Upvotes: 1

Views: 99

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

It is already common knowledge here on SO that in order to work with Unicode strings the PHP regex engine should get a pattern with /u flag. It is a less well-known fact that in order to match a Unicode grapheme one needs to use \X shorthand class (PCRE-compliant).

So, to apply some length restriction on a Unicode string pattern, use \X instead of .:

$pattern = '/^\X{0,30}$/u';

Note that this regex will match strings that contain 0 to 30 Unicode graphemes. You do not need any (...)? optional patterns, since 0 in the limiting quantifier already does this job.

However, to check the real length of the Unicode string, you need to use mb_strlen. See this post of mine for an example.

See this demo:

$pattern = '/^.{0,30}$/u';
$value='12345678901234567890123456789Å';
if (!preg_match($pattern, $value)) {
    echo "not match\n";
}
else echo "match!\n";

$pattern = '/^\X{0,30}$/u';
$value='12345678901234567890123456789Å';
if (!preg_match($pattern, $value)) {
    echo 'not match';
}
else echo "match!";

Results:

not match (this is the regex with a dot)
match!    (the regex based on \X)

Upvotes: 3

arkascha
arkascha

Reputation: 42984

You need to tell your regex engine that it should work in utf mode by using the u flag as modifier:

<?php
$pattern = '/^(.{0,30})?$/u';
$subject='12345678901234567890123456789ä';

if (!preg_match($pattern, $subject, $tokens)) {
    alert('not match');
}
var_dump($tokens);

Note the trailing u inside the pattern definition.

The output is:

array(2) {
  [0] =>
  string(31) "12345678901234567890123456789ä"
  [1] =>
  string(31) "12345678901234567890123456789ä"
}

Upvotes: 0

Related Questions