user2676847
user2676847

Reputation: 181

parse domains from html page using perl

i have an html page that contain urls like :

<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">

i want to extract :

site.com/path
www.site.org

between <h3><a href=" & /index.php .

i've tried this code :

#!/usr/local/bin/perl
use strict;
use warnings;

open (MYFILE, 'MyFileName.txt');
while (<MYFILE>) 
{
  my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
  my @values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content

    print $values2[0]; # here it must print www.site.org/path/ but it don't
    print "\n";
}
close (MYFILE);

but this give an output :

2
1
2
2
1
1

and it don't parse https websites. hope you've understand , regards.

Upvotes: 0

Views: 241

Answers (3)

dms
dms

Reputation: 817

The main thing wrong with your code is that when you call split in scalar context as in your line:

my $values1 = split('http://', $_);

It returns the size of the list created by the split. See split.

But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):

open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
    s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}

close($myfile_fh);

It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.

Upvotes: 2

justintime
justintime

Reputation: 3631

This feels to me like a job for modules

Generally using regexps to parse HTML is risky.

Upvotes: 1

amon
amon

Reputation: 57620

dms explained in his answer why using split isn't the best solution here:

  • It returns the number of items in scalar context
  • A normal regex is better suited for this task.

However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).

Given an URL, we can extract the required information like

if ($url =~ m{^https?://(.+?)/index\.php}s) {  # domain+path now in $1
  say $1;
}

But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.

use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp';  # makes it easy to read files.
use Mojo;

my $html_file = shift @ARGV;  # take file name from command line

my $dom = Mojo::DOM->new(scalar slurp $html_file);

for my $link ($dom->find('a[href]')->each) {
  say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}

The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.

As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.

$ perl the-script.pl html-with-links.html >only-links.txt

The whole script as a one-liner:

$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

Upvotes: 0

Related Questions