Ken Addams
Ken Addams

Reputation: 7

mb_detect_encoding seems to detect UTF8 but decoded string still shows weird characters

I'm struggling to detect the right encoding to insert an UTF8 CSV dataset to a database.

In my DB, all the text fields are created using utf8mb4_unicode_520_ci (this is how my Wordpress is configured so I can't really change that). So I assume it's a kind of UTF8 encoding..

For all the fields I'm using this function. Without this function, all inserts had strange characters. Now all the fields look good.

$row_data[$key] = mb_convert_encoding($value, 'ISO-8859-1', 'UTF-8');

... except for two fields. These two fields are collected in the same CSV but from another source (another web site) so I think for some fields in the CSV, the encoding may be different.

Here is an example with a sample data that doesn't want to be inserted into the DB.

<?php

$text = "Gergő Rácz";

// Détection de l'encodage
$encoding = mb_detect_encoding($text);

echo "encoding detected: " . $encoding;
$utf8_text = mb_convert_encoding($text, 'ISO-8859-1','UTF-8');

echo "\ntext to UTF-8 : " . $utf8_text;

# php ./p.php
encoding detected: UTF-8
text to UTF-8 : Gerg▒? Rácz

It's like it's already UTF-8 but not really. And I can't identify which encoding it is. Garbage characters in, garbage out.

Any idea ?

Many thanks !!

Upvotes: -1

Views: 61

Answers (1)

JosefZ
JosefZ

Reputation: 30103

ő (U+0151, Latin Small Letter O With Double Acute) isn't present in ISO-8859-1. Use Windows-1252 instead as follows:

<?php

$text = "Gergő Rácz";

// Détection de l'encodage
$encoding = mb_detect_encoding($text);

echo "\nencoding detected: " . $encoding;
$utf8_text = mb_convert_encoding($text, 'Windows-1252','UTF-8');

echo "\ntext to UTF-8 : " . $utf8_text;

?>

Output: php .\SO\78678820.php

encoding detected: UTF-8
text to UTF-8 : Gergő Rácz

Upvotes: 0

Related Questions