Perl Screen Scrape Data from Table

Question

I would like write a Perl script to get the html contents of a webpage and then scrape the contents of a table. The exact page is:

http://djbpmstudio.com/Default.aspx?Page=album&id=1

So far I am able to regex the Artist, Album, and Genre as well as the first entry in the table using the code below:

use LWP::Simple;

$url = "http://djbpmstudio.com/Default.aspx?Page=album&id=1";
my $mystring = get($url) or die "Error fetching source page.";
$mystring =~ s/[
]/ /g;      #remove line breaks from HTML
$mystring =~ s/(>)\s+(<)/$1$2/g;    #Remove white space between html tags 
#print $mystring;

if($mystring =~ m{(.*?) - (.*?) - (.*?)}) {
    #Get Artist name and print
    print "Artist: $1
";
    print "Album:  $2
";
    print "Genre:  $3

";

    if($mystring =~ m{(.*?)(.*?)}) {
        #Get Songname and BPM and print
        #print "$1	";
        print "$2	";
        print "$3
";
    }
}

In the nest IF, the class alternates between "row-a" and "row-b".

I am not sure how to go down the list and get all of the song names and BPMs for each. I would also like to put the songnames and BPMs into an array for processing later.

Thank you.

tadmc · Accepted Answer

Using regular expressions to process HTML is nearly always a bad idea.

Don't be bad.

Use a module that understands HTML data for processing HTML data.

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $html = get 'http://djbpmstudio.com/Default.aspx?Page=album&id=1';

my $te = new HTML::TableExtract( headers => ['Track Name', 'BPM'] );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
       next unless $row->[0] =~ /\w/;   # skip garbage rows
       printf "%-20s   ==>   %.2f
", $row->[0], $row->[1];
   }
}

Perl Screen Scrape Data from Table

Answers (2)

Related Questions