Tauri28
Tauri28

Reputation: 908

Php regular expressions character encoding issue

My regular expression wont consider accented characters thus not finding any matches when I am searching words containing ü,õ,ö or ä characters.

$data is HTML data stripped from HTML tags using strip_tags and containing words with ü, õ, ö and ä characters loaded via CURL from website with character encoding UTF-8 (as returned headers tell me);

$data = strip_tags( curl_exec('my_website_url') );
$match = preg_match( '/ü/' , $data , $matches );

I have tried using following (also with 'ISO-8859-1'):

mb_internal_encoding("UTF-8");
mb_regex_encoding('UTF-8');

or:

$data = utf8_decode($data)

Not success yet.

Upvotes: 1

Views: 1187

Answers (2)

Marcin Orlowski
Marcin Orlowski

Reputation: 75629

You should tell PRCE that you are using UTF-8 which is done by adding u modifier -> '/ü/u'. But if possible do not put these characters directly into source code. If you change (or your editor will) encoding of the file, your code will stop working and tracing this down would be quite PITA. I'd suggest, instead of using '/ü/' directly to replace character in question with its code: '/\x{c3bc}/u' - the 0xc3bc is your letter.

Upvotes: 0

Pekka
Pekka

Reputation: 449435

Make sure your PHP source file is UTF-8 encoded as well.

If it's for example ISO-8859-1, the ü in your preg_match directive will be a different character from the üs in your UTF-8 data.

Upvotes: 1

Related Questions