Arjun Nayini
Arjun Nayini

Reputation: 191

Web::Scraper and Perl

I have the following script that scrapes my schools CS department to get a list of all the courses. I want to be able to extract the CRN (course number) and other important information to put into a database which I can let users browse through a web app.

Here is an example URL: http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html

I would like to extract info from pages like this. The first level of the scraper just constructs the individual sites from a list of all of the courses. Once I'm at a course specific catalog page, I use the second scraper to attempt to get all of this info i want. For some reason, although the CRN's and Course Instructors are all 'td' elements. My scraper seems to be returning nothing when scraping. I tried to scrape specifically for 'div' instead and I get a bunch of info for each relevant page. So somehow I'm failing to get the 'td' element, but I'm scraping from the right page.

  my $tweets = scraper {
      # Parse all LIs with the class "status", store them into a resulting
      # array 'tweets'.  We embed another scraper for each tweet.
     # process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT';
      process "div.ws-row", "array[]" => 'TEXT';
      };

my $res = $tweets->scrape( URI-    >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169") );

foreach my $elem (@{$res->{array}}){

my $coursenum = substr($elem,2,4);

my $secondLevel = scraper{
process "td.ws-row", "array2[]" => 'TEXT';
};

my $res2 = $secondLevel->scrape(URI-    >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html"));
my $num = @{$res2->{array2}};
print $num;

print "---------------------", "\n";
my @curr = @{$res2->{array2}};
foreach my $elem2 (@curr){
$num++;
print $elem2, "    ", "\n";
}
print "---------------------", "\n";
}

Any ideas?

Thanks

Upvotes: 0

Views: 1167

Answers (3)

bvr
bvr

Reputation: 9697

I played a bit with your problem. You can get course id, title and link to individual course page within initial scraper:

my $courses = scraper {
    process 'div.ws-row',
        'course[]' => scraper {
            process 'div.ws-course-number',  'id'    => 'TEXT';
            process 'div.ws-course-title',   'title' => 'TEXT';
            process 'div.ws-course-title a', 'link'  => '@href';
        };
    result 'course';
};

The result of scraping is arrayref with hashrefs like this:

{   id    => "CS 103",
    title => "Introduction to Programming",
    link  => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"),
},
....

Then you can do additional scraping for each course from their individual pages and add such information into original structure:

for my $course (@$res) {
    my $crs_scraper = scraper {
        process 'div.ws-description', 'desc' => 'TEXT';
        # ... add more items here
    };
    my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));

    # slice assignment to add them into course definition
    @{$course}{ keys %$additional_data } = values %$additional_data;
}

Source combined together is as follows:

use strict; use warnings;
use URI;
use Web::Scraper;
use Data::Dump qw(dump);

my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169';

my $courses = scraper {
    process 'div.ws-row',
        'course[]' => scraper {
            process 'div.ws-course-number',  'id'    => 'TEXT';
            process 'div.ws-course-title',   'title' => 'TEXT';
            process 'div.ws-course-title a', 'link'  => '@href';
        };
    result 'course';
};

my $res = $courses->scrape(URI->new($url));

for my $course (@$res) {
    my $crs_scraper = scraper {
        process 'div.ws-description', 'desc' => 'TEXT';
        # ... add more items here
    };
    my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));

    # slice assignment to add them into course definition
    @{$course}{ keys %$additional_data } = values %$additional_data;
}

dump $res;

Upvotes: 1

snoofkin
snoofkin

Reputation: 8895

The easiest way to go in this case is use

HTML::TableExtract

In case you are looking for data from the table only.

Upvotes: 1

ysth
ysth

Reputation: 98398

Looks to me like

my $coursenum = substr($elem,2,4)

should be

my $coursenum = substr($elem,3,3)

Upvotes: 1

Related Questions