MattBelanger
MattBelanger

Reputation: 5350

Converting HTML Entities in UTF-8 to SHIFT_JIS

I am working with a website that needs to target old, Japanese mobile phones, that are not Unicode enabled. The problem is, the text for the site is saved in the database as HTML entities (ie, Ӓ). This database absolutely cannot be changed, as it is used for several hundred websites.

What I need to do is convert these entities to actual characters, and then convert the string encoding before sending it out, as the phones render the entities without converting them first.

I've tried both mb_convert_encoding and iconv, but all they are doing is converting the encoding of the entities, but not creating the text.

Thanks in advance

EDIT:

I have also tried html_entity_decode. It is producing the same results - an unconverted string.

Here is the sample data I am working with.

The desired result: シェラトン・ヌーサリゾート&スパ

The HTML Codes: シェラトン・ヌーサリゾート&スパ

The output of html_entity_decode([the string above],ENT_COMPAT,'SHIFT_JIS'); is identical to the input string.

Upvotes: 2

Views: 5396

Answers (4)

Mantisse
Mantisse

Reputation: 309

just to participate as I encountered some kind of encoding bug while coding, I would suggest this snippet :

 $string_to_encode=" your string ";
 if(mb_detect_encoding($string_to_encode)!==FALSE){
      $converted_string=mb_convert_encoding($string_to_encode,'UTF-8');
 }

Maybe not the best for a large amount of data, but still works.

Upvotes: 0

jeroen
jeroen

Reputation: 91762

I think you just need html_entity_decode.

Edit: Based on your edit:

$output = preg_replace_callback("/(&#[0-9]+;)/", create_function('$m', 'return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); '), $original_string); 

Note that this is just your first step, to convert your entities to the actual characters.

Upvotes: 0

hakre
hakre

Reputation: 198117

Just take care you're creating the right codepoints out of the entities. If the original encoding is UTF-8 for example:

$originalEncoding = 'UTF-8'; // that's only assumed, you have not shared the info so far
$targetEncoding = 'SHIFT_JIS';
$string = '... whatever you have ... ';
// superfluous, but to get the picture:
$string = mb_convert_encoding($string, 'UTF-8', $originalEncoding);
$string = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
$stringTarget = mb_convert_encoding($string, $targetEncoding, 'UTF-8');

Upvotes: 2

ianbarker
ianbarker

Reputation: 1254

I found this function on php.net, it works for me with your example:

function unhtmlentities($string) {
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table(HTML_ENTITIES);
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}

Upvotes: 1

Related Questions