Reputation: 7
I'm struggling to detect the right encoding to insert an UTF8 CSV dataset to a database.
In my DB, all the text fields are created using utf8mb4_unicode_520_ci (this is how my Wordpress is configured so I can't really change that). So I assume it's a kind of UTF8 encoding..
For all the fields I'm using this function. Without this function, all inserts had strange characters. Now all the fields look good.
$row_data[$key] = mb_convert_encoding($value, 'ISO-8859-1', 'UTF-8');
... except for two fields. These two fields are collected in the same CSV but from another source (another web site) so I think for some fields in the CSV, the encoding may be different.
Here is an example with a sample data that doesn't want to be inserted into the DB.
<?php
$text = "Gergő Rácz";
// Détection de l'encodage
$encoding = mb_detect_encoding($text);
echo "encoding detected: " . $encoding;
$utf8_text = mb_convert_encoding($text, 'ISO-8859-1','UTF-8');
echo "\ntext to UTF-8 : " . $utf8_text;
# php ./p.php
encoding detected: UTF-8
text to UTF-8 : Gerg▒? Rácz
It's like it's already UTF-8 but not really. And I can't identify which encoding it is. Garbage characters in, garbage out.
Any idea ?
Many thanks !!
Upvotes: -1
Views: 61
Reputation: 30103
ő
(U+0151, Latin Small Letter O With Double Acute) isn't present in ISO-8859-1
. Use Windows-1252
instead as follows:
<?php
$text = "Gergő Rácz";
// Détection de l'encodage
$encoding = mb_detect_encoding($text);
echo "\nencoding detected: " . $encoding;
$utf8_text = mb_convert_encoding($text, 'Windows-1252','UTF-8');
echo "\ntext to UTF-8 : " . $utf8_text;
?>
Output: php .\SO\78678820.php
encoding detected: UTF-8
text to UTF-8 : Gergő Rácz
Upvotes: 0