Tim Jones
Tim Jones

Reputation: 51

Different regex output on 2 PHP systems?

Given this test script:

<?php

echo setlocale(LC_ALL, '') . "\n";

$in = 'Città';

$var = preg_replace('/\s+$/', '', $in);

echo bin2hex($in) . "\n";
echo bin2hex($var) . "\n";

PHP 5.5.3 on Ubuntu, I get:

en_GB.UTF-8
43697474c3a0
43697474c3a0

PHP 5.5.9 on Mac (via Macports)

en_GB.UTF-8
43697474c3a0
43697474c3

Is there any reason why the Macports build will be treating the à character differently?

I'm aware that c3a0, when treated as two bytes in ASCII, is à followed by a non-breaking space. I am wondering why one system treats the 2 bytes as UTF-8 without the u modifier.

Upvotes: 5

Views: 123

Answers (1)

Piskvor left the building
Piskvor left the building

Reputation: 92792

Use the /u modifier:

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.

By default, the string is treated as a sequence of single-byte characters; the problem is that some of your characters are encoded as multibyte in UTF-8. While 0xc3a0 is a single codepoint, \s will match on its second byte, 0xa0, which is a non-breaking space, and therefore whitespace.

$var = preg_replace('/\s+$/u', '', $in);

should enable UTF-8 mode for matching, and it should work on all systems.

Upvotes: 1

Related Questions