Reputation: 43673
In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
Upvotes: 1
Views: 2250
Reputation: 22421
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord
+ necessary HTML entity formatting. Perl regexp have /e
flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест
Upvotes: 2
Reputation: 241858
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Upvotes: 3