Haroon Ahmad
Haroon Ahmad

Reputation: 27

how to extract accurate information from multiple tags in perl

I want to extract url information from ask . com

this is the tag

<p class="PartialSearchResults-item-url">maps.google.com </p>

This is the code, I tried but it's extracting junk info with it.

$p = HTML::TokeParser->new(\$rrs);

while ($p->get_tag("p")) {

    my @link = $p->get_trimmed_text("/p");

     foreach(@link) { print "$_\n"; }

      open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);

  }

I only want domain urls, like maps.google.com

but it's extracting, Source , Images and all sorts of other p class info junk with it, filling askurls.txt with irrelevant information

Added:

askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com

Upvotes: 1

Views: 94

Answers (1)

Mobrine Hayde
Mobrine Hayde

Reputation: 365

You can use a simple regex that will parse what you want for you

use strict;
use warnings;

my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
  print $1."\n";
}

If you are not sure if there is whitespace in the tag where the domain is

(like here <p class="PartialSearchResults-item-url">maps.google.com </p>)

You can use \s* like:

m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url

And if you want to check if domain is valid you can use is_domain() from Data::Validate::Domain module:

# previous script
use Data::Validate::Domain qw(is_domain);

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
   if (is_domain($1)) {
      print $1."\n";
   }
}

Upvotes: 4

Related Questions