Reputation: 3583
I have an xml file with mixed encoding (file is to be said in iso-8859-1 encoding though) but contain characters from windows 1252 also (trademark symbol, endash etc)
Im using PHP and xmlreader to parse xml file to save in database. MySQL 5.0 server is saving the mixed encoded characters as box character but MySQL 5.1 gives error.
so the question is, what is the easiest and full proof method to correctly save the utf-8 data.
This is my current code to convert it to utf-8, just wanted to know, if it may create problem while converting?
function cp1252_to_utf8($str)
{
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
return strtr(utf8_encode($str), $cp1252_map);
}
$sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
mysql_query($sql);
$arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["
booktitle"]));
Upvotes: 4
Views: 2037
Reputation: 70460
If you have mixed encodings in the same column, you have only 1 reasonable option: store as binary, rather then in a special charset. If the file is in cp1252
though (which overlaps for a huge part with ISO-8859-1
so probably you can just claim cp1252
as input), just call the iconv
function on it before loading as XML. ($utf8string = iconv('cp1252','utf-8',$string);
)
Upvotes: 1