thinkhy
thinkhy

Reputation: 933

How can convert character "%xx" in html using Perl

I intended to extract content from a web page which contains many unicode characters represented in the form of "%xx". As I used Perl module LWP to get web page, naturally handled these unicode characters using Perl Regex as below.

my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]+)#\x{\1}#ig;
print "$html\n";

But above code dosen't work, it output nothing but "00". Get stuck now ... Any hint would be appreciated.

Thanks, Ye

Upvotes: 1

Views: 806

Answers (4)

ikegami
ikegami

Reputation: 385744

First, that has nothing to do with HTML. That escaping mechanism is used by URI.

It seems really odd that you would have to do that. The only thing that usually needs to undo that encoding is CGI scripts receiving parameters, in which case all you need is

use CGI;
my $cgi = CGI->new();
my $foo = $cgi->param('foo');

But let's say you need to do your own URI parsing. You could use:

use URI;
my %form = URI->new($url)->query_form();
my $foo = $form{'foo'};

CGI, URI

Upvotes: 0

Orabîg
Orabîg

Reputation: 11992

Funny and ugly code :

my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]{2})#"chr(0x$1)"#igee;
print "$html\n";

Edit : (I'm obliged to say) this code is maybe cute, but do not use this in production ! (there are many cases where it's not working)

Upvotes: -1

Spudley
Spudley

Reputation: 168685

Perl has functions built in the URI::Escape module for this already. You don't need to mess with regular expressions

use URI::Escape;
my $encode = uri_unescape($string);

See this page for more

Upvotes: 7

Borodin
Borodin

Reputation: 126722

You need an executable substitution

$html =~ s/%([0-9a-f]+)/chr hex $1/ieg;

but it is better to use the URI::Escape module, which is part of Gisle Aas' excellent LWP suite

Upvotes: 2

Related Questions