Reputation: 11
I'm trying to scrape a webpage and get the values that are inside the html tags. The end result would be a way to separate values in a way that looks like Club: x Location: y URL: z
Here's what I have so far
use HTML::Tree;
use LWP::Simple;
$url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
$content = get($url);
$tree = HTML::Tree->new();
$tree->parse($content);
@td = $tree->look_down( _tag => 'td', class => 'ClubRow');
foreach $1 (@td) {
print $1->as_text();
print "\n";
}
And what is printed is like
AYSO UnitedMadison, ALwww.aysounitednorthalabama.org
This is what the HTML looks like
<td class="ClubRow" width="80%">
<div>
<a href="/rankings/club.aspx?ClubID=27086" class="ClubLink">AYSO United</a></div>
<div class="SubHeading">Madison, AL</div>
<a href="http://www.aysounitednorthalabama.org" target="_blank"><img src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font color="black">www.aysounitednorthalabama.org</font></a>
</td>
I need a way to either split these fields into separate variables or add some sort of deliminating character so I can do it with Regex. There isn't much documentation online so any help would be appreciative.
Upvotes: 0
Views: 322
Reputation: 132822
Here's a Mojolicious example. It's the same thing that Sinan did but with a different toolbox which has the tools to fetch and process the webpage. It looks a bit long, but that's just the comments and documentation. ;)
I like that Mojolicious is "batteries included", so once I load one of the modules, I probably have everything else I need for the task:
use v5.10;
use Mojo::UserAgent;
my $url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
my $ua = Mojo::UserAgent->new;
my $tx = $ua->get( $url );
# You could do some error checking here in case the fetch fails
$tx->res->dom
# there are lots of ClubRow td cells, but we want the one with
# the width attribute. Find all of those. See Mojo::DOM::CSS for
# docs on CSS selectors.
->find( 'td[class=ClubRow][width=80%]' )
# now go through each td and extract several things
->map( sub {
# these selectors represent the club location, name, and website
state $find = [ qw(
a[class=ClubLink]
div[class=SubHeading]
font[color=black]
) ];
my $chunk = $_;
# return the location, name, and link as a tuple for later
# processing
[
map { s/\t+/ /gr } # remove tabs so we can use them as a separator
map { $chunk->find( $_ )->map( 'text' )->[0] }
@$find
]
} )
# do something will all tuples. In this case, output them as tab
# separated values (which is why you removed tabs already). You
# should be able to easily import this into a spreadsheet application.
->each( sub { say join "\t", @$_ } );
The output has that annoying first line, but you can fix that up on your own:
*****Other Club*****
Alabama Soccer Association www.alsoccer.org
Alabaster Competitive SC acsc.teampages.com/
Alabaster Parks and Rec
Alex City YSL www.alexcitysoccer.com/
Auburn Thunder SC auburnthundersoccer.com/
AYSO United Madison, AL www.aysounitednorthalabama.org
Birmingham Area Adult Soccer League
Birmingham Bundesliga LLC Birmingham, AL www.birmingham7v7.com
Birmingham Premier League
Birmingham United SA Birmingham, AL, AL www.birminghamunited.com/
Blount County Youth Soccer Oneonta, AL bcysfury.com
Briarwood SC Birmingham, AL www.questrecreation.org/briarwood-soccer-club.html...
Capital City Streaks Montgomery, AL www.capitalcitystreaks.org
City of Calera Youth Soccer
Upvotes: 1
Reputation: 118138
First, this is an abomination:
foreach $1 (@td) {
print $1->as_text();
print "\n";
}
You might think it is cute, but it is confusing to use regex capture variables such as $1
as a loop variable especially since you also say "I need a way... so I can do it with Regex." (emphasis mine)
This is the kind of nonsense that leads to unmaintainable programs which give Perl a bad name.
Always use strict
and warnings
and use a plain variable for your loops.
Second, you are interested in three specific elements in each td
: 1) The text of a[class="ClubLink"]
; 2) The text of div[class="SubHeading"]
; and 3) The text of font[color="black"]
.
So, just extract those three bits of information instead of flattening the text inside a td
:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Tree;
my $html = <<HTML;
<td class="ClubRow" width="80%"> <div> <a
href="/rankings/club.aspx?ClubID=27086" class="ClubLink">AYSO United</a></div>
<div class="SubHeading">Madison, AL</div> <a
href="http://www.aysounitednorthalabama.org" target="_blank"><img
src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font
color="black">www.aysounitednorthalabama.org</font></a> </td>
HTML
my $tree = HTML::Tree->new_from_content( $html );
my @wanted = (
[class => 'ClubLink'],
[class => 'SubHeading'],
[_tag => 'font', color => 'black'],
);
my @td = $tree->look_down( _tag => 'td', class => 'ClubRow');
for my $td ( @td ) {
my ($club, $loc, $www) = map $td->look_down(@$_)->as_text, @wanted;
print join(' - ', $club, $loc, $www), "\n";
}
Output:
$ ./gg.pl
AYSO United - Madison, AL - www.aysounitednorthalabama.org
Of course, I would have probably used HTML::TreeBuilder::XPath to take advantage of XPath queries.
Upvotes: 3