Jay
Jay

Reputation: 1104

How do I encode Japanese into something like "日本に行って"? (UTF-8)

As the question in the title states. I can't seem to find the answer with any of the following: php headers, css headers, html headers, mysql charsets (to utf8_general_ci), or

<form acceptcharset="utf-8"... >

Really stumped on this one.

I'm basically going through this process:

  1. Type Japanese characters, process through a form
  2. Form saves in MySQL DB
  3. PHP pulls data out of MySQL DB, and formats it for a webpage

At step 3, I check the code and see that it's literally displaying the Japanese characters. Because it's doing that, I'm guessing it's causing the PHP errors I'm getting (the functions that work fine for English characters aren't working so fine for the Japanese text).

So I want to encode in UTF-8 format, but I'm not sure how to do this?

Edit: Here's the PHP function I'm using on the Japanese text

function short_text_jap($text, $length=300) { 
    if (strlen($text) > $length) { 
            $pattern = '/^(.{0,'.$length.'}\\b).*$/s'; 
            $text = preg_replace($pattern, "$1...", $text); 
    } 
    return $text;

But instead of a shortened amount of text, it returns the whole thing.

Upvotes: 3

Views: 4625

Answers (2)

Saul
Saul

Reputation: 18041

There seems to be a bit of a confusion about what UTF8 is: by stating the goal as getting the "UTF8 version" of literal Japanese characters.

Things like &#26085; are ASCII-compatible HTML entities (basically Unicode references) already represented in some encoding whereas UTF8 is a multibyte encoding scheme that defines how characters are stored on the byte level.

I suggest relying on the literal form since it makes the whole mess with international alphabets easier to manage.

Simply migrate to UTF8 everywhere: in the database, in HTML, in PHP and in file types. Then it would be possible to use the PHP Multibyte String extension which is designed to handle multibyte characters:

mb_internal_encoding("UTF-8");

function short_text_jap($text, $length=300) {
    return mb_strlen($text) > $length ? mb_substr($text, 0, $length) : $text;
}

echo short_text_jap('日本語', 2); // outputs 日本

Upvotes: 1

Gumbo
Gumbo

Reputation: 655229

As you seem to want to convert your UTF-8 encoded string to ASCII and non-ASCII characters to character references, you can use PHP’s multi-byte string functions to do so:

mb_substitute_character('entity');
$str = '日本語';  // UTF-8 encoded string
echo mb_convert_encoding($str, 'US-ASCII', 'UTF-8');

The output is:

&#x65E5;&#x672C;&#x8A9E;

Upvotes: 5

Related Questions