Reputation: 51678
I'm trying to search a UTF8-encoded string using preg_match.
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.
I have the following settings in my php.ini, and other UTF8 functions are working:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
Any ideas?
Upvotes: 48
Views: 79163
Reputation: 2441
Code below can work both as replacement for preg_match
and preg_match_all
functions and returns correct matches with correct offset for UTF8-encoded strings.
mb_internal_encoding('UTF-8');
/**
* Returns array of matches in same format as preg_match or preg_match_all
* @param bool $matchAll If true, execute preg_match_all, otherwise preg_match
* @param string $pattern The pattern to search for, as a string.
* @param string $subject The input string.
* @param int $offset The place from which to start the search (in bytes).
* @return array
*/
function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
{
$matchInfo = array();
$method = 'preg_match';
$flag = PREG_OFFSET_CAPTURE;
if ($matchAll) {
$method .= '_all';
}
$n = $method($pattern, $subject, $matchInfo, $flag, $offset);
$result = array();
if ($n !== 0 && !empty($matchInfo)) {
if (!$matchAll) {
$matchInfo = array($matchInfo);
}
foreach ($matchInfo as $matches) {
$positions = array();
foreach ($matches as $match) {
$matchedText = $match[0];
$matchedLength = $match[1];
$positions[] = array(
$matchedText,
mb_strlen(mb_strcut($subject, 0, $matchedLength))
);
}
$result[] = $positions;
}
if (!$matchAll) {
$result = $result[0];
}
}
return $result;
}
$s1 = 'Попробуем русскую строку для теста';
$s2 = 'Try english string for test';
var_dump(pregMatchCapture(true, '/обу/', $s1));
var_dump(pregMatchCapture(false, '/обу/', $s1));
var_dump(pregMatchCapture(true, '/lish/', $s2));
var_dump(pregMatchCapture(false, '/lish/', $s2));
Output of my example:
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "обу"
[1]=>
int(4)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "обу"
[1]=>
int(4)
}
}
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
Upvotes: 9
Reputation: 48091
I think working with PREG_OFFSET_CAPTURE
in this case only creates more work.
Demo of below scripts.
If the pattern only contains literal characters, then preg_
is overkill, just use mb_strpos()
and bear in mind that the returned value will be false
if the needle is not found in the haystack.
var_export(mb_strpos($str, 'H')); // 1
If you know that the needle will exist in the haystack, you can use preg_match_all()
with the marvellous \G
(continue) metacharacter and \X
(multibyte any character) metacharacter.
echo preg_match_all('/\G(?![A-Z])\X/u', $str); // 1
// if needle not found, will return the mb length of haystack
If you don't know if the needle will exist in the haystack, just check if the returned count is equal to the multibyte length of the input string.
$mbLength = preg_match_all('/\G(?![A-Z])\X/u', $str, $m);
var_export(mb_strlen($str) !== $mbLength ? $mbLength : 'not found');
But if you are going to call an extra mb_
function anyhow, then make just one match, check if a match was made, and measure its multibyte length if so.
var_export(
preg_match('/\X*?(?=[A-Z])/u', $str, $m) ? mb_strlen($m[0]) : 'not found'
);
All this said, I've never seen the need to count the multibyte position of something unless the greater task was to isolate or replace a substring. If this is the case, avoid this step entirely and just use preg_match()
or preg_replace()
to more directly serve your needs.
Upvotes: 0
Reputation: 422
The problem was solved to me just by using casual substr instead of expected mb_substr (PHP 7.4).
The mb_substr together with preg_match_all / PREG_OFFSET_CAPTURE (despite using or not using /u modifier)resulted in incorrect position when text contained euro sign symbol (€).
Also iconv and utf8_encode did not help, and I was not able to use htmlentities.
Just reverting to simple substr helped, and it worked with € and other characters correctly.
Upvotes: 0
Reputation: 21278
You can calculate the real UTF-8 offset by cutting the string to the offset returned by the preg_mach
with the byte-counting substr
and then measuring this prefix with the correct-counting mb_strlen
.
$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8');
Upvotes: 3
Reputation: 297
Try adding this (*UTF8) before the regex:
preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
Magic, thanks to a comment in https://www.php.net/manual/function.preg-match.php#95828
Upvotes: 28
Reputation: 2983
You might want to look at T-Regx library.
pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match)
{
echo $match->offset(); // characters
echo $match->byteOffset(); // bytes
});
This $match->offset()
is UTF-8 safe offset.
Upvotes: 1
Reputation: 34
I wrote small class to convert offsets returned by preg_match to proper utf offsets:
final class NonUtfToUtfOffset
{
/** @var int[] */
private $utfMap = [];
public function __construct(string $content)
{
$contentLength = mb_strlen($content);
for ($offset = 0; $offset < $contentLength; $offset ++) {
$char = mb_substr($content, $offset, 1);
$nonUtfLength = strlen($char);
for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
$this->utfMap[] = $offset;
}
}
}
public function convertOffset(int $nonUtfOffset): int
{
return $this->utfMap[$nonUtfOffset];
}
}
You can use it like that:
$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);
preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);
foreach ($m[1] as [$word, $offset]) {
echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}
Upvotes: 1
Reputation: 53960
Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391
'u' switch only makes sense for pcre, PHP itself is unaware of it.
From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").
Upvotes: 24
Reputation: 655825
Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.
You can use mb_strlen
to get the length in UTF-8 characters rather than bytes:
$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
Upvotes: 52
Reputation: 6516
If all you want to do is find the multi-byte safe position of H try mb_strpos()
mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";
Output:
¡Hola!
1
H
Upvotes: 1