Reputation: 960

How do I extract an HTML element based on its class?

I'm just starting out in Perl, and wrote a simple script to do some web scraping. I'm using WWW::Mechanize and HTML::TreeBuilder to do most of the work, but I've run into some trouble. I have the following HTML:

<table class="winsTable">
    <thead>...</thead>
    <tbody>
        <tr>
            <td class = "wins">15</td>
        </tr>
    </tbody>
</table>

I know there are some modules that get data from tables, but this is a special case; not all the data I want is in a table. So, I tried:

my $tree = HTML::TreeBuilder->new_from_url( $url );
my @data = $tree->find('td class = "wins"');

But @data returned empty. I know this method would work without the class name, because I've successfully parsed data with $tree->find('strong'). So, is there a module that can handle this type of HTML syntax? I scanned through the HTML::TreeBuilder documentation and didn't find anything that appeared to, but I could be wrong.

Upvotes: 10

Answers (4)

Anna

Reputation: 121

I found the this link the most useful at telling me how to extract specific information from html content. I used the last example on the page:

use v5.10;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);

$mech->get( 'http://htmlparsing.com/' );

# Find all <h1> tags
my @list = $mech->find('h1');

# or this way <----- I found this way very useful to pinpoint exact classes with in some html
my @list = $mech->look_down('_tag' => 'h1', 
                            'class' => 'main_title');

# Now just iterate and process
foreach (@list) {
    say $_->as_text();
}

This seemed so much simpler to get up and running than any of the other modules that I looked at. Hope this helps!

Upvotes: 0

doubleDown

Reputation: 8408

(This is kind of a supplementary answer to dspain's)

Actually you missed a spot in the HTML::TreeBuilder documentation where it says,

Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.

(Note that the bold formatting is mine, it's not in the documentation)

This indicates that you should read HTML::Element's documentation as well, where you would find the find method which says

This is just an alias to find_by_tag_name

This should tell you that it doesn't work for class names, but its description also mentions a look_down method which can be found slightly further down. If you look at the example, you'd see that it does what you want. And dspain's answer shows precisely how in your case.

To be fair, the documentation is not that easy to navigate.

Upvotes: 1

dms

Reputation: 817

You could use the look_down method to find the specific tag and attributes you're looking for. This is in the HTML::Element module (which is imported by HTML::TreeBuilder).

my $data = $tree->look_down(
    _tag  => 'td',
    class => 'wins'
);

print $data->content_list, "\n" if $data; #prints '15' using the given HTML

$data = $tree->look_down(
    _tag  => 'td',
    class => 'losses'
);

print $data->content_list, "\n" if $data; #prints nothing using the given HTML

Upvotes: 10

gangabass

Reputation: 10666

I'm using excellent (but a bit slow sometimes) HTML::TreeBuilder::XPath module:

my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my @data = $tree->findvalues('//table[ @class = "winsTable" ]//td[@class = "wins"]');

Upvotes: 7

How do I extract an HTML element based on its class?

Answers (4)

Related Questions