Jake Fallin
Jake Fallin

Reputation: 11

HTML Parsing using HTML::Tree in Perl

I'm trying to scrape a webpage and get the values that are inside the html tags. The end result would be a way to separate values in a way that looks like Club: x Location: y URL: z

Here's what I have so far

use HTML::Tree;
use LWP::Simple;

$url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
$content = get($url);
$tree = HTML::Tree->new();
$tree->parse($content);
@td = $tree->look_down( _tag => 'td', class => 'ClubRow');
foreach $1 (@td) {
    print $1->as_text();
    print "\n";
}

And what is printed is like

AYSO UnitedMadison, ALwww.aysounitednorthalabama.org

This is what the HTML looks like

<td class="ClubRow" width="80%">
   <div>
       <a href="/rankings/club.aspx?ClubID=27086" class="ClubLink">AYSO United</a></div>
   <div class="SubHeading">Madison, AL</div>
       <a href="http://www.aysounitednorthalabama.org" target="_blank"><img src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font color="black">www.aysounitednorthalabama.org</font></a>
</td>

I need a way to either split these fields into separate variables or add some sort of deliminating character so I can do it with Regex. There isn't much documentation online so any help would be appreciative.

Upvotes: 0

Views: 322

Answers (2)

brian d foy
brian d foy

Reputation: 132822

Here's a Mojolicious example. It's the same thing that Sinan did but with a different toolbox which has the tools to fetch and process the webpage. It looks a bit long, but that's just the comments and documentation. ;)

I like that Mojolicious is "batteries included", so once I load one of the modules, I probably have everything else I need for the task:

use v5.10;
use Mojo::UserAgent;

my $url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
my $ua = Mojo::UserAgent->new;

my $tx = $ua->get( $url );

    # You could do some error checking here in case the fetch fails
$tx->res->dom
        # there are lots of ClubRow td cells, but we want the one with
        # the width attribute. Find all of those. See Mojo::DOM::CSS for 
        # docs on CSS selectors.
    ->find( 'td[class=ClubRow][width=80%]' )
        # now go through each td and extract several things
    ->map( sub {
            # these selectors represent the club location, name, and website
        state $find = [ qw(
            a[class=ClubLink]
            div[class=SubHeading]
            font[color=black]
            ) ];
        my $chunk = $_;

            # return the location, name, and link as a tuple for later
            # processing
        [
            map { s/\t+/ /gr } # remove tabs so we can use them as a separator
            map { $chunk->find( $_ )->map( 'text' )->[0] }
            @$find
        ]
        } )
        # do something will all tuples. In this case, output them as tab
        # separated values (which is why you removed tabs already). You 
       # should be able to easily import this into a spreadsheet application.
    ->each( sub { say join "\t", @$_ } );

The output has that annoying first line, but you can fix that up on your own:

*****Other Club*****
Alabama Soccer Association      www.alsoccer.org
Alabaster Competitive SC        acsc.teampages.com/
Alabaster Parks and Rec
Alex City YSL       www.alexcitysoccer.com/
Auburn Thunder SC       auburnthundersoccer.com/
AYSO United Madison, AL www.aysounitednorthalabama.org
Birmingham Area Adult Soccer League
Birmingham Bundesliga LLC   Birmingham, AL  www.birmingham7v7.com
Birmingham Premier League
Birmingham United SA    Birmingham, AL, AL  www.birminghamunited.com/
Blount County Youth Soccer  Oneonta, AL bcysfury.com
Briarwood SC    Birmingham, AL  www.questrecreation.org/briarwood-soccer-club.html...
Capital City Streaks    Montgomery, AL  www.capitalcitystreaks.org
City of Calera Youth Soccer

Upvotes: 1

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118138

First, this is an abomination:

foreach $1 (@td) {
    print $1->as_text();
    print "\n";
}

You might think it is cute, but it is confusing to use regex capture variables such as $1 as a loop variable especially since you also say "I need a way... so I can do it with Regex." (emphasis mine)

This is the kind of nonsense that leads to unmaintainable programs which give Perl a bad name.

Always use strict and warnings and use a plain variable for your loops.

Second, you are interested in three specific elements in each td: 1) The text of a[class="ClubLink"]; 2) The text of div[class="SubHeading"]; and 3) The text of font[color="black"].

So, just extract those three bits of information instead of flattening the text inside a td:

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::Tree;

my $html = <<HTML;
<td class="ClubRow" width="80%"> <div> <a
href="/rankings/club.aspx?ClubID=27086" class="ClubLink">AYSO United</a></div>
<div class="SubHeading">Madison, AL</div> <a
href="http://www.aysounitednorthalabama.org" target="_blank"><img
src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font
color="black">www.aysounitednorthalabama.org</font></a> </td>
HTML

my $tree = HTML::Tree->new_from_content( $html );

my @wanted = (
    [class => 'ClubLink'],
    [class => 'SubHeading'],
    [_tag => 'font', color => 'black'],
);

my @td = $tree->look_down( _tag => 'td', class => 'ClubRow');

for my $td ( @td ) {
    my ($club, $loc, $www) = map $td->look_down(@$_)->as_text, @wanted;
    print join(' - ', $club, $loc, $www), "\n";
}

Output:

$ ./gg.pl
AYSO United - Madison, AL - www.aysounitednorthalabama.org

Of course, I would have probably used HTML::TreeBuilder::XPath to take advantage of XPath queries.

Upvotes: 3

Related Questions