Gaurav Pant
Gaurav Pant

Reputation: 4199

Convert multiple Unicode in a string to character

Problem -- I have a string, say Buna$002C_TexasBuna$002C_Texas' and where $ is followed by Unicode. I want to replace these Unicode with its respective Unicode character representation.

In Perl if any Unicode is in the form of "\x{002C} then it will be converted to it respective Unicode character. Below is the sample code.

#!/usr/bin/perl
my $string = "Hello \x{263A}!\n";
@arr= split //,$string;
print "@arr";

I am processing a file which contain 10 million of records. So I have these strings in a scalar variable. To do the same as above I am substituting $4_digit_unicode to \x{4_digit_unicode} as below.

$str = 'Buna$002C_TexasBuna$002C_Texas';
$str =~s/\$(.{4})/\\x\{$1\}/g;
$str = "$str"

It gives me

Buna\x{002C}_TexasBuna\x{002C}_Texas

It is because at $str = "$str", line $str is being interpolated, but not its value. So \x{002C} is not being interpolated by Perl.

Is there a way to force Perl so that it will also interpolate the contents of $str too?

OR

Is there another method to achieve this? I do not want to take out each of the Unicodes then pack it using pack "U4",0x002C and then substitute it back. But something in one line (like the below unsuccessful attempt) is OK.

$str =~ s/\$(.{4})/pack("U4",$1)/g;

I know the above is wrong; but can I do something like above?

For the input string $str = 'Buna$002C_TexasBuna$002C_Texas', the desired output is Buna,_TexasBuna,_Texas.

Upvotes: 2

Views: 399

Answers (3)

ikegami
ikegami

Reputation: 385754

"\x{263A}" (quotes included) is a string literal, a piece of code that produces a string containing the lone character 263A when it's evaluated by the interpreter (by being part of the script passed to perl to be evaluated).

"\\x\{$1\}" (quotes included), on the other hand, produces a string consisting of \, x, {, the contents of $1, and }.

The latter is the string you are producing. You appear to be attempting to produce Perl code, but it's not valid Perl code -- it's missing the quotes -- and you never have the code interpreted by perl.


 $str =~ s/\$(.{4})/\\x\{$1\}/g;

is short for

 $str =~ s/\$(.{4})/ "\\x\{$1\}" /eg;

which is completely different than

 $str =~ s/\$(.{4})/ "\x{263A}" /eg;

It looks like you were going for the following:

$str =~ s/\$(.{4})/ eval qq{"\\x\{$1\}"} /eg;

But there are much simpler ways of producing the desired string, such as

$str =~ s/\$(.{4})/ pack "U4", $1 /eg;

or better yet,

$str =~ s/\$(.{4})/ chr hex $1 /eg;

Upvotes: 1

AdrianHHH
AdrianHHH

Reputation: 14038

This gives the desired result:

use strict;
use warnings;
use feature 'say';

my $str = 'Buna$002C_TexasBuna$002C_Texas';

$str =~s/\$(.{4})/chr(hex($1))/eg;

say $str;

The main interesting item is the e in s///eg. The e means to treat the replacement text as code to be executed. The hex() converts a string of hexadecimal characters to a number. The chr() converts a number to a character. The replace line might be better written as below to avoid trying to convert a dollar followed by non-hexadecimal characters.

$str =~s/\$([0-9a-f]{4})/chr(hex($1))/egi;

Upvotes: 7

Ibrahim Najjar
Ibrahim Najjar

Reputation: 19423

You can execute statements such as pack in the replacement string, you just have to use the e regular expression modifier.

Or you can do this

$str =~s/\$(.{4})/"@{[pack("U4",$1)]}/g;

If those two options don't work please let me know, take a look at this Stackoverflow question for more information.

Upvotes: 1

Related Questions