Trouble decoding string from JSON in PHP \u00e6\u0097\u00a5\u00e6\u009c\u00ac

Question

TLDR: Trying to convert the string \u00e6\u0097\u00a5\u00e6\u009c\u00ac to 日本 in php. (Trying to get \u00e6\u0097\u00a5\u00e6\u009c\u00ac to echo out 日本)

Hi folks,

I have a json file from Instagram (downloaded my data) and many of my posts contain Japanese text which is stored encoded in UTF-8 (and please correct me if I'm wrong, especially as mb_detect_encoding("\u00e6\u0097\u00a5\u00e6\u009c\u00ac") returns "ASCII").

For example \u00e6\u0097\u00a5\u00e6\u009c\u00ac becomes 日本.

The conversions can be seen working fine on this encoder/decoder website: https://mothereff.in/utf-8

(Note that if you put 日本 into the above site it returns \xE6\x97\xA5\xE6\x9C\xAC, so adding \xE6\x97\xA5\xE6\x9C\xAC \u00e6\u0097\u00a5\u00e6\u009c\u00ac to the encoded field will produce 日本日本 in the decoded field)

I'm trying to convert it back to regular Japanese text but am having issues.

I've been googling and looking over Stackoverflow for just over a day and have been trying many different methods, but I just can't get it to convert. I'm clearly missing something. In most cases it does not change at all.

For the scope of this question, I'm simply trying to convert \u00e6\u0097\u00a5\u00e6\u009c\u00ac into 日本. I am not trying to convert the json file (though am open to any suggestions that would need me to).

(For the record I am using the variable $str for \u00e6\u0097\u00a5\u00e6\u009c\u00ac)

The following attempts resulted in no visible change, \u00e6\u0097\u00a5\u00e6\u009c\u00ac

echo call_user_func_array('mb_convert_encoding', array(&$str,'HTML-ENTITIES','UTF-8'));
echo iconv('ASCII', 'UTF-8', $str);
echo iconv("UTF-8", "CP1252", $str);
echo iconv('UTF-8', 'ISO-8859-1', $str);
echo iconv('UTF-8', 'UTF-8//IGNORE', utf8_encode($str));
echo iconv('ISO-8859-1', 'UTF-8', $str);
echo iconv('ISO-8859-9', 'UTF-8', $str);
echo iconv(mb_detect_encoding($str, mb_detect_order(), true), "UTF-8", $str);
echo htmlentities($str);
echo mb_convert_encoding($str, 'utf-8', 'iso-8859-1');
echo mb_convert_encoding($str, "EUC-JP", "auto");
echo mb_convert_encoding($str, "utf-8", "windows-1251");
echo mb_convert_encoding($str, "windows-1251", "utf-8");
echo mb_convert_encoding($str,'HTML-ENTITIES', 'UTF-8');
echo mb_convert_encoding($str,"UTF-8","auto");
echo mb_convert_encoding($str,"UTF-8");
echo mb_convert_encoding($str, 'UTF-8', array('EUC-JP', 'SHIFT-JIS', 'AUTO'));
echo mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($str, "UTF-8", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, "ISO-8859-1", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
echo utf8_decode($str);
echo utf8_encode($str);

The following attempt resulted in the slash being duplicated with double quotation marks added, "\u00e6\u0097\u00a5\u00e6\u009c\u00ac"

echo json_encode($str,JSON_HEX_TAG);
echo json_encode($str,JSON_UNESCAPED_UNICODE |JSON_PRETTY_PRINT);
echo json_encode($str,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);

The following attempt resulted in nothing being returned,

echo json_decode($str, JSON_HEX_TAG);
echo json_decode($str, false);
echo json_decode($str, false, 512, JSON_UNESCAPED_UNICODE);

The following attempted resulted in the slashes changing to an unknown character, �_u00e6�_u0097�_u00a5�_u00e6�_u009c�_u00ac

echo mb_convert_encoding($str, "SJIS");

From the PHP documentation I tried this to see if any of the combinations would work, but none did. https://www.php.net/manual/en/function.mb-convert-encoding.php#97902

foreach(mb_list_encodings() as $chr){
    echo mb_convert_encoding($str, 'UTF-8', $chr)." : ".$chr."
";   
}
echo "
--- REVERSE TRY ---

";
foreach(mb_list_encodings() as $chr){
    echo mb_convert_encoding($str, $chr, 'UTF-8')." : ".$chr."
";   
}

I tried using the Unicode Codepoint Escape Syntax, which gave æ—¥æœ¬ https://www.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax

echo "\u{00e6}\u{0097}\u{00a5}\u{00e6}\u{009c}\u{00ac}";

As mentioned in the brackets earlier, \xE6\x97\xA5\xE6\x9C\xAC does convert to 日本 when echoed.

echo "\xE6\x97\xA5\xE6\x9C\xAC";

Noticing above that the two different codes had the same endings, I tried using str_replace so that they would match, but this time \xE6\x97\xA5\xE6\x9C\xAC was echoed.

echo str_replace("\U00","\x",strtoupper($str));

I have also tried all of the above with and without the following:

header('Content-Type: text/plain; charset="UTF-8"');

Here is a segment of the original JSON file (original file is 13k lines, so here is a single element).

{
    "media": [
      {
        "uri": "media/posts/202104/175127092_241529264421003_4026764649651789139_n_18106766305234668.jpg",
        "creation_timestamp": 1619277565,
        "title": "Time to head back to Tokyo.
Fukuoka Airport, Japan.
18 October 2020
.
.
.
.
.
#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088  #GodJesusRobot #robot #toyholiday  #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography  #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window"
      }
    ]
  }

UPDATE

Based on the comments by @jerry and @yourcommonsense, hexbin can work so the string will have to be converted by dropping the \u00. hex2bin(str_replace('\u00', '', $str)); will definitely work for the string mentioned in the TLDR and upper part of the question, but to tackle the full title string in the json I've come up with a very ugly and messy method.

$str = "Time to head back to Tokyo.
Fukuoka Airport, Japan.
18 October 2020
.
.
.
.
.
#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088  #GodJesusRobot #robot #toyholiday  #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography  #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window";
$pattern = '/(\\u00..)+/i';

function getHex2Bin($matches) {
    return hex2bin(str_replace("\U00","",strtoupper($matches[0])));
}

$result = preg_replace_callback($pattern, 'getHex2Bin', $str);
echo $result;

This does work, giving me my desired result: Time to head back to Tokyo. Fukuoka Airport, Japan. 18 October 2020 . . . . . #japan #日本 #toyphotography #toy #おもちゃ #ロボット #GodJesusRobot #robot #toyholiday #holiday #vacation #旅行 #photography #写真 #japan_of_insta #japantravel #日本旅行 #travel #kitakyushu #北九州 #airport #空港 #fukuokaairport #福岡空港 #plane #airplane #aeroplane #飛行機 #windowseat #window but I can't help feel that there is a much neater solution.

Update 2

Here is a PHP Sandbox showing the results of all attempts mentioned above, including the messy working one.

JosefZ · Accepted Answer

You face a mojibake case (example in Python for its universal intelligibility):

print('\u00e6\u0097\u00a5\u00e6\u009c\u00ac'.encode('latin1').decode())

日本

Let's rewrite above code in PHP terms (utilizing the json_decode function):

 test: chinese to JSON => "
               . json_encode($str_chin) . PHP_EOL;

$str = '\u00e6\u0097\u00a5\u00e6\u009c\u00ac';
$str_moj = json_decode('"' . $str . '"', JSON_INVALID_UTF8_IGNORE ); 

echo $str . " => mojibake => "
          . $str_moj . PHP_EOL;

echo $str . " => solution => " 
          . mb_convert_encoding($str_moj, 'iso-8859-1', 'utf-8');

?>

Output:

73099438.php

日本 => test: chinese to JSON => "\u65e5\u672c"
\u00e6\u0097\u00a5\u00e6\u009c\u00ac => mojibake => æ¥æ¬
\u00e6\u0097\u00a5\u00e6\u009c\u00ac => solution => 日本

Trouble decoding string from JSON in PHP \u00e6\u0097\u00a5\u00e6\u009c\u00ac

Answers (1)

Related Questions