Reputation: 449
I am new to Perl and I want write a simple script which will be getting the webpage content via LSW::Simple get() and then I want it to grep in the get() result for some regex match. Here is my code:
$content = get("http://pl.wikipedia.org/wiki/$arg1");
my $result = grep(/en\.wikipedia\.org\/wiki\/[A-Za-z]+\"\s*title/, $content);
print $result;
When I print the result it is "1". How can I get the String which is hidden there: 'en.wikipedia.org/wiki/TextIWantToGet" title'?
Thanks in advance!
Upvotes: 2
Views: 7709
Reputation: 185126
What I would do using your base code :
use strict; use warnings;
use LWP::UserAgent;
use HTTP::Request;
my $arg1 = "Rower";
# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");
# Pass request to the user agent and get a response back
my $res = $ua->request($req);
# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;
my $content = $res->content;
$content =~ /en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/;
print $1;
But parsing HTML with regex are discouraged, instead, going further & learn how to use HTML::TreeBuilder::XPath using xpath :
use strict; use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use HTTP::Request;
my $arg1 = "Rower";
# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");
# Pass request to the user agent and get a response back
my $res = $ua->request($req);
# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $res->content );
# Using XPath, searching for all links having a 'title' attribute
# and having a 'href' attribute matching 'en.wikipedia.org'
my $link = $tree->findvalue(
'//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
$link =~ s!.*/!!;
print "$link\n";
Just for fun, this is a concise version using WWW::Mechanize :
use strict; use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
my $m = WWW::Mechanize->new( autocheck => 1 );
$m->get("http://pl.wikipedia.org/wiki/$ARGV[0]");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $m->content );
print join "\n", map { s!.*/!!; $_ } $tree->findvalues(
'//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
Upvotes: 6
Reputation: 12002
You need to wrap $result in brackets to force list context instead of scalar context. The Perl documentation for grep says
"Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true."
So you need to use something like
my ($result) = grep(/en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/, $content);
However it really depends which part of the html your actually interested in? the end of the URL? or the title of the page?
the above code will grab anything after /wiki/ which is upper or lowercase A-Z thats all that should be in the $result.
Upvotes: 2