Marcus
Marcus

Reputation: 9439

PHP URLDecode / UTF8_Encode Character Set Issues with special characters

I'm passing a pound symbol £ to a PHP page which has been URLEncoded by ASP as %C2%A3.

The problem:

urldecode("%C2%A3") // £
ord(urldecode("%C2%A3")) // get the character number - 194
ord("£") // 163  - somethings gone wrong, they should match

This means when I do utf8_encode(urldecode("%C2%A3")) I get £

However doing utf8_encode("£") I get £ as expected

How can I solve this?

Upvotes: 4

Views: 12199

Answers (4)

Dexter
Dexter

Reputation: 3122

The first comment on php.net for urlencode() explains why this is and suggests this code for correcting it:

<?php
function to_utf8( $string ) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
    if ( preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string) ) {
        return $string;
    } else {
        return iconv( 'CP1252', 'UTF-8', $string);
    }
}
?> 

Also you should decide wether you want your final html you send to the browser to be in utf-8 or some other encoding, otherwise you will continue having £ characters in your code.

Upvotes: -1

Arkh
Arkh

Reputation: 8459

Some infos about urldecode and UTF-8 can be found in the first comment of the urldecode documentation. It seems to be a known problem.

Upvotes: 2

Kaivosukeltaja
Kaivosukeltaja

Reputation: 15735

I don't think ord() is multibyte compatible. It's probably returning only the code for the first character in the string, which is Â. Try to utf8_decode() the string before calling ord() on it and see if that helps.

ord(utf8_decode(urldecode("%C2%A3"))); // This returns 163

Upvotes: 3

Wh1T3h4Ck5
Wh1T3h4Ck5

Reputation: 8509

if you try

var_dump(urldecode("%C2%A3"));

you'll see

string(2) "£"

because this is 2-byte character and ord() returns value of first one (194 = Â)

Upvotes: 4

Related Questions