Reputation: 1

Php5 - Unicode string length

I need to get correct length of unicode text getting via HTTP Post/get.

"हेल्लो स्टैक ओवरफ्लो"

When I set a browser's character encoding as Unicode, then mb_strlen($text) gives me correct length of unicode string which is 20.

But when I submit form with browsers encoding as 'ISO-8859-1', it behaves oddly. mb_strlen($text) gives me byte length of unicode string which is 128, which is wrong and also

mb_detect_encoding($text, "auto") returns me ascii. while mb_detect_encoding($text, "UTF-8") returns UTF-8.

I need correct length of unicode text, irrespective of Browser charset.

anyone can help me sovling this problem?

Regards, Sandip

Upvotes: 0

Answers (2)

bobince

Reputation: 536567

I need correct length of unicode text, irrespective of Browser charset.

You can't know the length if you don't know the encoding. A string of bytes may represent a different valid string in different encodings at once. mb_detect_charset gives you nothing more than an unreliable guess.

There is a sneaky way many modern browsers support for them to tell you what encoding they have used, which is to include this hack (originating in IE) in the form:

<input type="hidden" name="_charset_"/>

You'll then get an encoding name submitted in that field, which you can theoretically use to mb_convert_encoding a string you have received to UTF-8 for further handling. You definitely want to keep all your strings in a single encoding in your scripts, only converting to other encodings at the input/output ends where necessary; it's very unpleasant trying to keep track of byte strings in arbitrary encodings.

However you can't convert a ISO-8859-1 string containing हेल्लो... to UTF-8 because ISO-8859-1 can't contain those characters. Your data have already been corrupted as described by deceze: when you submit form data in an encoding that can't contain the characters, the browser escapes them using HTML &#...; character references. This is a lossy conversion that you can't accurately recover, because you can't tell the difference between these escapes and actual ampersand-hash sequences the user originally typed. Never rely on this long-standing but quirky and undesirable behaviour.

In general it's really much better just to ensure that the form submission always comes in using a known encoding that covers all the characters you are likely to want. That way you don't have to worry about conversion, or whether there has been any character-reference-mangling. The only sensible encoding to pick for this purpose is UTF-8. (UTF-16 has some browser problems apart from being generally less efficient.)

Browsers submit forms using the same encoding that they used to display the page, so use the Content-Type: text/html;charset=utf-8 header and/or <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> equivalent to specify the page encoding, rather than letting the browser guess. It will then use that encoding for the form submission too.

The only remaining wrinkle is that if the user deliberately overrides the encoding of the page with the form in, you'll get the wrong data submitted. This is very unlikely to happen unless your page is already broken, so usually it's not worth bothering with.

If you want to cover that possibility you can set the attribute accept-charset on the form. However! This doesn't work right in IE, which only treats accept-charset as a fallback suggestion for when it has form data that doesn't fit within the page's natural encoding. If you want to ensure you get UTF-8 even in the face of the user changing the encoding to something else, you'd have to include some data in the form that can't be encoded in any of the other encodings the user is able to pick. The traditional way of doing that is:

<form accept-charset="utf-8">
    <input type="hidden" name="unicodesnowman" value="&#x2603;"/>
    ...

Upvotes: 1

Polynomial

Reputation: 28316

ISO-8859-1, aka the Western European character set, refers to the extended Roman alphabet, which does not include the characters you specified above (is that Hindi? I'm not so well-versed in such languages). The mb_detect_encoding call will not detect your encoding, because you mangled the characters into ISO-8869-1, which doesn't support the characters you gave it.

You should specify an encoding that supports the character types that you need to display. UTF-8 would probably be your best bet. You can explicitly set the encoding in your HTTP headers using the Content-Encoding header. You can also repeat this in a meta tag in your HTML for maximum support.

Upvotes: 2

Php5 - Unicode string length

Answers (2)

Related Questions