Reputation: 27
I want to extract url information from ask . com
this is the tag
<p class="PartialSearchResults-item-url">maps.google.com </p>
This is the code, I tried but it's extracting junk info with it.
$p = HTML::TokeParser->new(\$rrs);
while ($p->get_tag("p")) {
my @link = $p->get_trimmed_text("/p");
foreach(@link) { print "$_\n"; }
open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);
}
I only want domain urls, like maps.google.com
but it's extracting, Source , Images and all sorts of other p class info junk with it, filling askurls.txt with irrelevant information
Added:
askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com
Upvotes: 1
Views: 94
Reputation: 365
You can use a simple regex that will parse what you want for you
use strict;
use warnings;
my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
print $1."\n";
}
If you are not sure if there is whitespace in the tag where the domain is
(like here <p class="PartialSearchResults-item-url">maps.google.com </p>
)
You can use \s*
like:
m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url
And if you want to check if domain is valid you can use is_domain()
from Data::Validate::Domain
module:
# previous script
use Data::Validate::Domain qw(is_domain);
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
if (is_domain($1)) {
print $1."\n";
}
}
Upvotes: 4