clamp
clamp

Reputation: 34006

Revert badly encoded umlauts

For some reason my special characters got encoded as the following string in a mysql database:

Ã?

Which shows up as:

Ã?

But actually should show up as:

Ö
  1. What went wrong here? I use UTF-8 everywhere.

  2. How can I fix this without recreating all content?

Upvotes: 0

Views: 232

Answers (2)

Rick James
Rick James

Reputation: 142218

For PDO, use something like

$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);

Ã? is two or three things going wrong, not just one! C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.

Let's see what is in the table; do

SELECT col, HEX(col) FROM tbl WHERE ...

(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)

Upvotes: 0

trincot
trincot

Reputation: 350137

I executed the following in PHP:

<?php
echo str_replace("&", "&amp;", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';    
echo str_replace("&", "&amp;", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>

The str_replace is just there to reveal any HTML mnemonics, which would otherwise be translated by the browser to the original character, which I don't want to happen.

You will get this as output:

&Atilde;�
&Ouml;

You'll recognise the first value as what you found in the database, and the second one is a bit like you wanted it to be. Add to this the fact that the default value for the third argument to htmlentities depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use. Also realise that HTML documents which do not specify a character encoding will by default post form data in ISO-8859-1 format. Combining all this might give a clue about the cause of your problem:

My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.

Now to the measures to take that this does not continue to happen:

First make sure that your HTML form has the UTF-8 encoding, because this determines the default encoding that a form will use for sending its data to the server:

<head>
    <meta charset="UTF-8">
</head> 

Make sure this is not overruled by another encoding in the form tag's accept-charset attribute.

Then, skip the htmlentities call. You should not turn characters into their HTML mnemonic when storing them in the database. MySql supports UTF-8 characters, so just store them like that.

For the second question, you'll have to find all cases and bulk replace them as you find new instances. You could get get a little help by producing some SQL statements with a PHP script like the following:

<?php
    // list all your non-ASCII characters here. Do not use str_split.
    $chars = ["Ö","õ","Ũ","ũ"];
    foreach ($chars as $ch) {
        $bad = str_replace("&", "&amp;", htmlentities($ch, 0, "ISO-8859-1"));
        echo "update mytable set myfield = replace(myfield, '$bad', '$ch') 
                where instr(myfield, '$bad') > 0;<br />";
    }
?>

The output of this script will look like this:

update mytable set myfield = replace(myfield, '&Atilde;�', 'Ö') where instr(myfield, '&Atilde;�') > 0;
update mytable set myfield = replace(myfield, '&Atilde;&micro;', 'õ') where instr(myfield, '&Atilde;&micro;') > 0;
update mytable set myfield = replace(myfield, '&Aring;&uml;', 'Ũ') where instr(myfield, '&Aring;&uml;') > 0;
update mytable set myfield = replace(myfield, '&Aring;&copy;', 'ũ') where instr(myfield, '&Aring;&copy;') > 0;

Of course, you could decide to make a PHP script that will even do the updates itself.

Hopefully you can use this information to fix the issues.

Upvotes: 2

Related Questions