wheresrhys
wheresrhys

Reputation: 23500

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.

Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.

Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?

*edit - some further research reveals

utf8_decode("í") == í;
utf8_encode("í") == í;
utf8_encode("\xc3\xad") ==  í;

Upvotes: 1

Views: 1667

Answers (2)

goat
goat

Reputation: 31813

You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.

my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.

heres an example without using literals

$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));

be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.

also-just because the other server claims its utf8, doesn't mean it really is.

Upvotes: 1

Ansari
Ansari

Reputation: 8218

utf8_encode is definitely not the way to go here (you're double-encoding if you do that).

Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?

Upvotes: 1

Related Questions