Maarten
Maarten

Reputation: 4671

How do browsers/PHP handle characters outside the set characterset?

I'm looking into how characters are handled that are outside of the set characterset for a page.

In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.

As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed. I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.

Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...

Upvotes: 1

Views: 207

Answers (4)

AlexV
AlexV

Reputation: 23098

Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.

You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

Upvotes: 0

user187291
user187291

Reputation: 53940

htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.

A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).

Upvotes: 1

Your Common Sense
Your Common Sense

Reputation: 157860

Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.

using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.

Upvotes: 0

wimvds
wimvds

Reputation: 12850

He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).

btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.

Upvotes: 0

Related Questions