user1700243
user1700243

Reputation: 13

Handling multi language website

I have a multi-language website that communicates with a database, which contains language-specific translations.

For example, a table gender has 10 rows, and each row indicates a language.

+---------+-----------+-----+
| English | French    | etc |
| Male    | Masculine | ... |
+---------+-----------+-----+

Some languages (like Chinese, Greek, Turkish, Spanish, Russian, etc. have characters outside of latin1, and when i read the data from the database on my site they come out with ? and garbled symbols (mojibake)

So, how do I fix this?

I know i need to use certain collation on the db and add the specific meta charset tag but it's still not working.

 cp1256 | Windows Arabic          | cp1256_general_ci (it's not giving me the correct arabic solution.)
 gbk    | GBK Simplified Chinese  | gbk_chinese_ci    (it's not giving me the correct chinese solution.)

Upvotes: 1

Views: 1172

Answers (4)

m4t1t0
m4t1t0

Reputation: 5721

You should use specific tables for traductions, not columns. In this case you can specify the charset for every table.

At this moment you have:

+---------+-----------+-----+
| English | French    | etc |
| Male    | Masculine | ... |
+---------+-----------+-----+

You should have:

gender_en
+-----------+--------------+
| id_gender |       value  |
+-----------+--------------+
|         1 |         Male |
|         2 |       Female |
+-----------+--------------+

gender_es
+-----------+--------------+
| id_gender |       value  |
+-----------+--------------+
|         1 |       Hombre |
|         2 |        Mujer |
+-----------+--------------+

gender_fr
.....

And so on

Upvotes: 1

martinstoeckli
martinstoeckli

Reputation: 24071

The easiest way will be to use UTF-8 for the whole website. UTF-8 can work with all known characters of other encodings. If you are using mysql, it is important that you tell the connection object to use UTF-8, before you make a query. I wrote a short article of how you can use UTF-8 in PHP and MySQL.

The collation is not the same as the charset, it only defines how two values are compared (e.g. for sorting).

Upvotes: 0

SDC
SDC

Reputation: 14222

There are a whole load of areas of your system that need to be considered when looking at multi-lingual systems.

You need to to ensure that you are using a suitable character encoding throughout your system. In most cases, the best choice of character encoding is UTF-8. (There are some instances where UTF-8 is insufficient, for which cases there is UTF-16, but these cases are few and far between, and PHP will struggle with UTF-16 anyway, so in general stick with UTF-8 for everything and you'll be fine).

You need to make sure you're using the same character encoding in the following places:

  • Your database tables.
  • Your web server.
  • Your PHP source code.

The database is easy to deal with: just make sure all tables are created with UTF-8 encoding for their charset. Job done.

Collation is less relevant -- this specifies the sort order. This does matter of course, but does not have any relevance to the garbled text display you're seeing. (it's worth saying that some characters are sorted differently in different languages, so it's virtually impossible to pick a collation mode that will suit everyone if you need to support multiple languages in a single table, but I wouldn't get too worried about this for now).

The web server is relatively simple too, as long as you're comfortable with Apache config (or whatever server software you're using). You need to ensure that all pages output to the browser are sent using UTF-8 encoding.

Finally, your PHP source code...

Firstly, you should make sure you're editing the actual PHP code files in UTF-8 mode. Otherwise, any you may have trouble if you have any extended characters written in your code.

Secondly, be aware that a number of PHP's standard string handling functions are "not multi-byte aware". This means that they don't work correctly with extended character sets. For example, strlen() will return the number of bytes the string takes up in memory. This will be incorrect if your string includes characters that take up more than one byte. Fortunately, PHP also supplies a set of multi-byte functions to resolve this. So for example, instead of using strlen(), use mb_strlen(). The PHP manual gives more detail about the exact functions available and when to use them.

Also, make sure that you handle any incoming posted data with the correct character set as well.

Hopefully that will help you. The key here is to ensure that your system uses a consistent character set throughout all its layers. Problems with weird-looking encoding errors tend to happen when one layer in your system is using a different character set to the others. Make sure they're all the same (and preferably UTF-8), and you should deal with your garbled character problems.

Upvotes: 1

Explosion Pills
Explosion Pills

Reputation: 191729

Collation is only used for sorting purposes while charset is used for storage. Apparently you're using the latin1 charset, which is interesting. Many would suggest to go with a utf-8 charset, so you will have to convert all of your data to that charset now. Personally, I would use binary data (binary vs. char, varbinary vs. varchar, blob vs. text). This is only a problem if you need accurate sorting (collation) as binary sorting is different.

Upvotes: 1

Related Questions