Duke Leto
Duke Leto

Reputation: 215

Perl: how to convert unicode symbols to something that will survive trip into and out of database and render in HTML

TL;DR I have input that looks like this:

इस परीक्षण के लिए है
Something
Zürich

This data is then piped through a few programs and is ultimately inserted into a mongodb database. But by the time I query it out and try to display it on a web page it's all garbage.

I've found a lot of questions on how to encode these things but all the answers assume you want everything encoded and do not discuss how to decode it for display.

I only want the "weird" stuff encoded, so for the above I'd like to get some output like this

0x1234;0x8737;0x838784; ...
Something
Z0x8387;rich

which would store fine in a database, and would survive a vim edit or whatever else, but then when I pull it out I want it to render correctly.

So how do I do that, encode in Perl and decode in Javascript?

PS: I don't know what that string of symbols means, just found it somewhere. Sorry if it's offensive or something. Thanks!

Edit: choroba's answer is a very good start, let's see with an example of what the algorithm produces:

input: 株式会社イノ設計
output: 0x230;0x160;0x170;0x229;0x188;0x143;0x228;0x188;0x154;0x231;0x164;0x190;0x227;0x130;0x164;0x227;0x131;0x142;0x232;0x168;0x173;0x232;0x168;0x136;

Now how do I render that in Javascript? 0xNN was just an example of what I imagine the answer would be but if there's a better way by all means!

Thanks!

Upvotes: 0

Views: 82

Answers (1)

choroba
choroba

Reputation: 241918

Here's an example that produces something similar to what you want:

#! /usr/bin/perl
use warnings;
use strict;

sub escape {
    my ($in) = @_;
    $in =~ s/([\x{80}-\x{ffff}])/sprintf '0x%d;', ord $1/ger
}

my $in = "Z\N{LATIN SMALL LETTER U WITH DIAERESIS}rich";
my $out = 'Z0x252;rich';

$out eq escape($in) or die escape($in) . "\n$out\n";

You seem to want decimal digits after 0x. That's confusing as 0x usually means hexadecimal. To get hexadecimal codes, change the sprintf template to 0x%x;.

Also note that once someone enters 0x123; into your data directly, the data will become corrupted.

If you use &# instead of 0x at the beginning of each replaced character, the browser will render the characters correctly: Zürich renders as "Zürich".

Upvotes: 2

Related Questions