Reputation: 48
TLDR: Trying to convert the string \u00e6\u0097\u00a5\u00e6\u009c\u00ac
to 日本
in php.
(Trying to get \u00e6\u0097\u00a5\u00e6\u009c\u00ac
to echo out 日本
)
Hi folks,
I have a json file from Instagram (downloaded my data) and many of my posts contain Japanese text which is stored encoded in UTF-8 (and please correct me if I'm wrong, especially as mb_detect_encoding("\u00e6\u0097\u00a5\u00e6\u009c\u00ac")
returns "ASCII"
).
For example \u00e6\u0097\u00a5\u00e6\u009c\u00ac
becomes 日本
.
The conversions can be seen working fine on this encoder/decoder website: https://mothereff.in/utf-8
(Note that if you put 日本
into the above site it returns \xE6\x97\xA5\xE6\x9C\xAC
, so adding \xE6\x97\xA5\xE6\x9C\xAC \u00e6\u0097\u00a5\u00e6\u009c\u00ac
to the encoded field will produce 日本 日本
in the decoded field)
I'm trying to convert it back to regular Japanese text but am having issues.
I've been googling and looking over Stackoverflow for just over a day and have been trying many different methods, but I just can't get it to convert. I'm clearly missing something. In most cases it does not change at all.
For the scope of this question, I'm simply trying to convert \u00e6\u0097\u00a5\u00e6\u009c\u00ac
into 日本
.
I am not trying to convert the json file (though am open to any suggestions that would need me to).
(For the record I am using the variable $str
for \u00e6\u0097\u00a5\u00e6\u009c\u00ac
)
The following attempts resulted in no visible change, \u00e6\u0097\u00a5\u00e6\u009c\u00ac
echo call_user_func_array('mb_convert_encoding', array(&$str,'HTML-ENTITIES','UTF-8'));
echo iconv('ASCII', 'UTF-8', $str);
echo iconv("UTF-8", "CP1252", $str);
echo iconv('UTF-8', 'ISO-8859-1', $str);
echo iconv('UTF-8', 'UTF-8//IGNORE', utf8_encode($str));
echo iconv('ISO-8859-1', 'UTF-8', $str);
echo iconv('ISO-8859-9', 'UTF-8', $str);
echo iconv(mb_detect_encoding($str, mb_detect_order(), true), "UTF-8", $str);
echo htmlentities($str);
echo mb_convert_encoding($str, 'utf-8', 'iso-8859-1');
echo mb_convert_encoding($str, "EUC-JP", "auto");
echo mb_convert_encoding($str, "utf-8", "windows-1251");
echo mb_convert_encoding($str, "windows-1251", "utf-8");
echo mb_convert_encoding($str,'HTML-ENTITIES', 'UTF-8');
echo mb_convert_encoding($str,"UTF-8","auto");
echo mb_convert_encoding($str,"UTF-8");
echo mb_convert_encoding($str, 'UTF-8', array('EUC-JP', 'SHIFT-JIS', 'AUTO'));
echo mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($str, "UTF-8", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, "ISO-8859-1", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
echo utf8_decode($str);
echo utf8_encode($str);
The following attempt resulted in the slash being duplicated with double quotation marks added, "\\u00e6\\u0097\\u00a5\\u00e6\\u009c\\u00ac"
echo json_encode($str,JSON_HEX_TAG);
echo json_encode($str,JSON_UNESCAPED_UNICODE |JSON_PRETTY_PRINT);
echo json_encode($str,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
The following attempt resulted in nothing being returned,
echo json_decode($str, JSON_HEX_TAG);
echo json_decode($str, false);
echo json_decode($str, false, 512, JSON_UNESCAPED_UNICODE);
The following attempted resulted in the slashes changing to an unknown character, �_u00e6�_u0097�_u00a5�_u00e6�_u009c�_u00ac
echo mb_convert_encoding($str, "SJIS");
From the PHP documentation I tried this to see if any of the combinations would work, but none did. https://www.php.net/manual/en/function.mb-convert-encoding.php#97902
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, 'UTF-8', $chr)." : ".$chr."<br>";
}
echo "<br>--- REVERSE TRY ---<br><br>";
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, $chr, 'UTF-8')." : ".$chr."<br>";
}
I tried using the Unicode Codepoint Escape Syntax, which gave 日本
https://www.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax
echo "\u{00e6}\u{0097}\u{00a5}\u{00e6}\u{009c}\u{00ac}";
As mentioned in the brackets earlier, \xE6\x97\xA5\xE6\x9C\xAC
does convert to 日本
when echoed.
echo "\xE6\x97\xA5\xE6\x9C\xAC";
Noticing above that the two different codes had the same endings, I tried using str_replace
so that they would match, but this time \xE6\x97\xA5\xE6\x9C\xAC
was echoed.
echo str_replace("\U00","\x",strtoupper($str));
I have also tried all of the above with and without the following:
header('Content-Type: text/plain; charset="UTF-8"');
Here is a segment of the original JSON file (original file is 13k lines, so here is a single element).
{
"media": [
{
"uri": "media/posts/202104/175127092_241529264421003_4026764649651789139_n_18106766305234668.jpg",
"creation_timestamp": 1619277565,
"title": "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088 #GodJesusRobot #robot #toyholiday #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window"
}
]
}
UPDATE
Based on the comments by @jerry and @yourcommonsense, hexbin
can work so the string will have to be converted by dropping the \u00
. hex2bin(str_replace('\u00', '', $str));
will definitely work for the string mentioned in the TLDR and upper part of the question, but to tackle the full title string in the json I've come up with a very ugly and messy method.
$str = "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088 #GodJesusRobot #robot #toyholiday #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window";
$pattern = '/(\\\\u00..)+/i';
function getHex2Bin($matches) {
return hex2bin(str_replace("\U00","",strtoupper($matches[0])));
}
$result = preg_replace_callback($pattern, 'getHex2Bin', $str);
echo $result;
This does work, giving me my desired result:
Time to head back to Tokyo. Fukuoka Airport, Japan. 18 October 2020 . . . . . #japan #日本 #toyphotography #toy #おもちゃ #ロボット #GodJesusRobot #robot #toyholiday #holiday #vacation #旅行 #photography #写真 #japan_of_insta #japantravel #日本旅行 #travel #kitakyushu #北九州 #airport #空港 #fukuokaairport #福岡空港 #plane #airplane #aeroplane #飛行機 #windowseat #window
but I can't help feel that there is a much neater solution.
Update 2
Here is a PHP Sandbox showing the results of all attempts mentioned above, including the messy working one.
Upvotes: 1
Views: 1279
Reputation: 30173
You face a mojibake case (example in Python for its universal intelligibility):
print('\u00e6\u0097\u00a5\u00e6\u009c\u00ac'.encode('latin1').decode())
日本
Let's rewrite above code in PHP
terms (utilizing the json_decode
function):
<?php
$str_chin = "日本";
echo $str_chin . " => test: chinese to JSON => "
. json_encode($str_chin) . PHP_EOL;
$str = '\u00e6\u0097\u00a5\u00e6\u009c\u00ac';
$str_moj = json_decode('"' . $str . '"', JSON_INVALID_UTF8_IGNORE );
echo $str . " => mojibake => "
. $str_moj . PHP_EOL;
echo $str . " => solution => "
. mb_convert_encoding($str_moj, 'iso-8859-1', 'utf-8');
?>
Output:
73099438.php
日本 => test: chinese to JSON => "\u65e5\u672c" \u00e6\u0097\u00a5\u00e6\u009c\u00ac => mojibake => æ¥æ¬ \u00e6\u0097\u00a5\u00e6\u009c\u00ac => solution => 日本
Upvotes: 2