Reputation: 4375
I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.
In my database I have a few instances of bad encodings that print like: î
I need some sort of function that will help me map the instances of î, ÃÂ, ü and others like it to their proper accented UTF-8 characters.
Upvotes: 70
Views: 146626
Reputation: 1318
I recently had to work on a legacy project which had the MySQL table collation set to latin1_swedish_ci
and when I retrieved the data from PHP it was showing up as encoded garbage ¡à¥à¤¨à¥‡ रहर à¤
. The text was supposed to show up as utf8
.
Specifying the charset after the db connection in PHP fixed it for me:
mysqli_set_charset($conn,"latin1");
I'd like to know how setting up charset to latin1
fixed this up and how to clean up the db properly from someone more knowledgeable about this.
Upvotes: 0
Reputation: 59
$bad_string = "Luis Pérez Casas, del Collettivo di avvocati “José Alvear Restrepoâ€, Colombia, un’organizzazione soggetta a costanti minacce";
$good_string = fix_broken_chars($bad_string);
echo $good_string;
function fix_broken_chars($garbled_utf8_string)
{
$conv_table = unserialize('a:5:{i:0;a:3:{s:8:"’";s:3:"’";s:8:"–";s:3:"–";s:8:"—";s:3:"—";}i:1;a:12:{s:7:"€";s:3:"€";s:7:"‚";s:3:"‚";s:7:"„";s:3:"„";s:7:"…";s:3:"…";s:7:"‡";s:3:"‡";s:7:"‰";s:3:"‰";s:7:"‹";s:3:"‹";s:7:"‘";s:3:"‘";s:7:"“";s:3:"“";s:7:"•";s:3:"•";s:7:"â„¢";s:3:"™";s:7:"›";s:3:"›";}i:2;a:22:{s:5:"À";s:2:"À";s:5:"Â";s:2:"Â";s:5:"Æ’";s:2:"ƒ";s:5:"Ä";s:2:"Ä";s:5:"Ã…";s:2:"Å";s:5:"â€";s:3:"”";s:5:"Æ";s:2:"Æ";s:5:"Ç";s:2:"Ç";s:5:"ˆ";s:2:"ˆ";s:5:"É";s:2:"É";s:5:"Ë";s:2:"Ë";s:5:"Å’";s:2:"Œ";s:5:"Ñ";s:2:"Ñ";s:5:"Ã’";s:2:"Ò";s:5:"Ó";s:2:"Ó";s:5:"Ô";s:2:"Ô";s:5:"Õ";s:2:"Õ";s:5:"Ö";s:2:"Ö";s:5:"×";s:2:"×";s:5:"Ù";s:2:"Ù";s:5:"Û";s:2:"Û";s:5:"Å“";s:2:"œ";}i:3;a:77:{s:4:"Ã";s:2:"Ã";s:4:"È";s:2:"È";s:4:"Ê";s:2:"Ê";s:4:"ÃŒ";s:2:"Ì";s:4:"Ž";s:2:"Ž";s:4:"ÃŽ";s:2:"Î";s:4:"Ëœ";s:2:"˜";s:4:"Ø";s:2:"Ø";s:4:"Å¡";s:2:"š";s:4:"Ú";s:2:"Ú";s:4:"Ãœ";s:2:"Ü";s:4:"ž";s:2:"ž";s:4:"Þ";s:2:"Þ";s:4:"Ÿ";s:2:"Ÿ";s:4:"ß";s:2:"ß";s:4:"¡";s:2:"¡";s:4:"á";s:2:"á";s:4:"¢";s:2:"¢";s:4:"â";s:2:"â";s:4:"£";s:2:"£";s:4:"ã";s:2:"ã";s:4:"¤";s:2:"¤";s:4:"ä";s:2:"ä";s:4:"Â¥";s:2:"¥";s:4:"Ã¥";s:2:"å";s:4:"¦";s:2:"¦";s:4:"æ";s:2:"æ";s:4:"§";s:2:"§";s:4:"ç";s:2:"ç";s:4:"¨";s:2:"¨";s:4:"è";s:2:"è";s:4:"©";s:2:"©";s:4:"é";s:2:"é";s:4:"ª";s:2:"ª";s:4:"ê";s:2:"ê";s:4:"«";s:2:"«";s:4:"ë";s:2:"ë";s:4:"¬";s:2:"¬";s:4:"ì";s:2:"ì";s:4:"Â";s:2:"";s:4:"Ã";s:2:"í";s:4:"®";s:2:"®";s:4:"î";s:2:"î";s:4:"¯";s:2:"¯";s:4:"ï";s:2:"ï";s:4:"°";s:2:"°";s:4:"ð";s:2:"ð";s:4:"±";s:2:"±";s:4:"ñ";s:2:"ñ";s:4:"²";s:2:"²";s:4:"ò";s:2:"ò";s:4:"³";s:2:"³";s:4:"ó";s:2:"ó";s:4:"´";s:2:"´";s:4:"ô";s:2:"ô";s:4:"µ";s:2:"µ";s:4:"õ";s:2:"õ";s:4:"¶";s:2:"¶";s:4:"ö";s:2:"ö";s:4:"·";s:2:"·";s:4:"÷";s:2:"÷";s:4:"¸";s:2:"¸";s:4:"ø";s:2:"ø";s:4:"¹";s:2:"¹";s:4:"ù";s:2:"ù";s:4:"º";s:2:"º";s:4:"ú";s:2:"ú";s:4:"»";s:2:"»";s:4:"û";s:2:"û";s:4:"¼";s:2:"¼";s:4:"ü";s:2:"ü";s:4:"½";s:2:"½";s:4:"ý";s:2:"ý";s:4:"¾";s:2:"¾";s:4:"þ";s:2:"þ";s:4:"¿";s:2:"¿";s:4:"ÿ";s:2:"ÿ";}i:4;a:1:{s:2:"Ã";s:2:"à";}}');
foreach ($conv_table as $convert) {
$garbled_utf8_string = str_replace(array_keys($convert), $convert, $garbled_utf8_string);
}
return $garbled_utf8_string;
}
Implements this table http://www.i18nqa.com/debug/utf8-debug.html
Upvotes: 4
Reputation: 869
I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.
Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.
If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding()
to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.
However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:
header("Content-Type: text/html; charset=utf-8");
ini_set("default_charset", 'utf-8');
AddDefaultCharset UTF-8
htmlspecialchars()
, that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)
Upvotes: 66
Reputation: 455
In my case, I found out by using "mb_convert_encoding" that the previous encoding was iso-8859-1 (which is latin1) then I fixed my problem by using an sql query :
UPDATE myDB.myTable SET myColumn = CAST(CAST(CONVERT(myColumn USING latin1) AS binary) AS CHAR)
However, it is stated in the mysql documentations that conversion may be lossy if the column contains characters that are not in both character sets.
Upvotes: 1
Reputation: 12897
This script had a nice approach. Converting it to the language of your choice should not be too difficult:
http://plasmasturm.org/log/416/
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( decode FB_QUIET );
binmode STDIN, ':bytes';
binmode STDOUT, ':encoding(UTF-8)';
my $out;
while ( <> ) {
$out = '';
while ( length ) {
# consume input string up to the first UTF-8 decode error
$out .= decode( "utf-8", $_, FB_QUIET );
# consume one character; all octets are valid Latin-1
$out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
}
print $out;
}
Upvotes: 0
Reputation: 33432
If you utf8_encode()
on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.
I made a function toUTF8()
that converts strings into UTF-8.
You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.
I used this myself on a feed with mixed encodings in the same string.
Usage:
$utf8_string = Encoding::toUTF8($mixed_string);
$latin1_string = Encoding::toLatin1($mixed_string);
My other function fixUTF8()
fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Download:
https://github.com/neitanod/forceutf8
Upvotes: 93
Reputation: 6825
If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe ’, quotation mark “, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.
Like this:
mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
--skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
--default-character-set=utf8 DB_NAME < DB_NAME-dump.sql
This was a 100% fix for my double encoded UTF-8.
Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/
Upvotes: 97
Reputation: 7457
Another thing to check, which happened to be my solution (found here), is how data is being returned from your server. In my application, I'm using PDO to connect from PHP to MySQL. I needed to add a flag to the connection which said get the data back in UTF-8 format
The answer was
$dbHandle = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8", $dbUser, $dbPass,
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'"));
Upvotes: 1
Reputation: 191
I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the mb_convert_encoding()
I manage to fix it with
mb_convert_encoding($text, 'Windows-1252', 'UTF-8')
Upvotes: 19
Reputation: 4197
I found a solution after days of search. My comment is going to be buried but anyway...
I get the corrupted data with php.
I don't use set names UTF8
I use utf8_decode() on my data
I update my database with my new decoded data, still not using set names UTF8
and voilà :)
Upvotes: 0
Reputation: 1042
i had the same problem long time ago, and it fixed it using
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">
Upvotes: 0
Reputation: 27858
As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.
E.g., for utf8 stored as latin1 the following SQL will fix it:
UPDATE table
SET field = CONVERT( CAST(field AS BINARY) USING utf8)
WHERE $broken_field_condition
Upvotes: 11
Reputation: 4375
I know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:
function fix_double encoding($string)
{
$utf8_chars = explode(' ', 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö');
$utf8_double_encoded = array();
foreach($utf8_chars as $utf8_char)
{
$utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char));
}
$string = str_replace($utf8_double_encoded, $utf8_chars, $string);
return $string;
}
This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.
Upvotes: 3
Reputation: 2084
It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point.
When you say "In my database I have a few instances of bad encodings" - how did you check this? Through your app, phpmyadmin or the command line client? Are all utf-8 encodings showing up like this or only some? Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already?
Upvotes: 0