DemiImp
DemiImp

Reputation: 1018

Perl UTF8 Concatenation Problems

I am having trouble concatenating a utf8 string to another after a string has been encoded and then decoded.

#!/usr/bin/perl
use strict;
use utf8;
use URI::Escape;

# binmode(STDOUT, ":utf8");

my $v = "ضثصثضصثشس";
my $v2 = uri_unescape(uri_escape_utf8($v));

print "Works: $v, ", "$v2\n";
print "Fails: $v, $v2\n";
print "Works: " . "$v2\n";

Here's the output:

Works: ضثصثضصثشس ,ضثصثضصثشس
Wide character in print at ./testUTF8.pl line 14.
Fails: ضثصثضصثشس, ضثصثضصثشس
Works: ضثصثضصثشس

If I use binmode utf8, as perl's docs suggest, the warning message disappears but all 3 fail:

Fails: ضثصثضصثشس, ضثصثضصثشس
Fails: ضثصثضصثشس, ضثصثضصثشس
Fails: ضثصثضصثشس

What's going on? How can I fix this?

P.S. I need it URL escaped. Is there any way I can escape/unescape in perl like javascript does? For example, Perl gives me: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3

This unescapes to: ضثصثضصثشس

When I escape the same text with Javascript, I get: %u0636%u062B%u0635%u062B%u0636%u0635%u062B%u0634%u0633

Upvotes: 5

Views: 515

Answers (2)

ikegami
ikegami

Reputation: 385789

uri_unescape is the inverse of uri_escape. It doesn't presume the bytes represent a UTF-8 string.

An inverse for uri_escape_utf8 isn't provided. Maybe so you can handle errors?

#!/usr/bin/perl
use strict;
use utf8;                     # Source code is UTF-8 encoded.
use open ':std', ':utf8';     # Terminal expects UTF-8.
use URI::Escape;

my $ov = "ضثصثضصثشس";

my $uri_comp = uri_escape_utf8($ov);

my $nv = uri_unescape($uri_comp);
utf8::decode($nv) or die;

print "$ov -> $uri_comp -> $nv\n";

ضثصثضصثشس -> %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3 -> ضثصثضصثشس

Upvotes: 3

amon
amon

Reputation: 57600

From the documentation of URI::Escape:

uri_unescape($string,...)
Returns a string with each %XX sequence replaced with the actual byte (octet).

It does not interpret the resulting bytes as UTF-8 and will not decode them, you will have to do this manually:

use Encode qw/decode_utf8/;

# untested
my $v2 = decode_utf8 uri_unescape uri_escape_utf8 $v;
...

Upvotes: 5

Related Questions