Ωmega
Ωmega

Reputation: 43673

Convert utf-8 into html &...;

In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?

Upvotes: 1

Views: 2250

Answers (2)

Oleg V. Volkov
Oleg V. Volkov

Reputation: 22421

Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.

use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест

Upvotes: 2

choroba
choroba

Reputation: 241858

First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:

use Encode qw( decode_utf8 );

my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);

my $html = join q(),
    map { ord > 127 ? "&#" . ord . ";"
                    : $_
        } split //, $unicode_string;

Upvotes: 3

Related Questions