Reputation: 933
I intended to extract content from a web page which contains many unicode characters represented in the form of "%xx". As I used Perl module LWP to get web page, naturally handled these unicode characters using Perl Regex as below.
my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]+)#\x{\1}#ig;
print "$html\n";
But above code dosen't work, it output nothing but "00". Get stuck now ... Any hint would be appreciated.
Thanks, Ye
Upvotes: 1
Views: 806
Reputation: 385744
First, that has nothing to do with HTML. That escaping mechanism is used by URI.
It seems really odd that you would have to do that. The only thing that usually needs to undo that encoding is CGI scripts receiving parameters, in which case all you need is
use CGI;
my $cgi = CGI->new();
my $foo = $cgi->param('foo');
But let's say you need to do your own URI parsing. You could use:
use URI;
my %form = URI->new($url)->query_form();
my $foo = $form{'foo'};
Upvotes: 0
Reputation: 11992
Funny and ugly code :
my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]{2})#"chr(0x$1)"#igee;
print "$html\n";
Edit : (I'm obliged to say) this code is maybe cute, but do not use this in production ! (there are many cases where it's not working)
Upvotes: -1
Reputation: 168685
Perl has functions built in the URI::Escape
module for this already. You don't need to mess with regular expressions
use URI::Escape;
my $encode = uri_unescape($string);
See this page for more
Upvotes: 7
Reputation: 126722
You need an executable substitution
$html =~ s/%([0-9a-f]+)/chr hex $1/ieg;
but it is better to use the URI::Escape
module, which is part of Gisle Aas' excellent LWP suite
Upvotes: 2