Reputation: 191
I have the following script that scrapes my schools CS department to get a list of all the courses. I want to be able to extract the CRN (course number) and other important information to put into a database which I can let users browse through a web app.
Here is an example URL: http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html
I would like to extract info from pages like this. The first level of the scraper just constructs the individual sites from a list of all of the courses. Once I'm at a course specific catalog page, I use the second scraper to attempt to get all of this info i want. For some reason, although the CRN's and Course Instructors are all 'td' elements. My scraper seems to be returning nothing when scraping. I tried to scrape specifically for 'div' instead and I get a bunch of info for each relevant page. So somehow I'm failing to get the 'td' element, but I'm scraping from the right page.
my $tweets = scraper {
# Parse all LIs with the class "status", store them into a resulting
# array 'tweets'. We embed another scraper for each tweet.
# process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT';
process "div.ws-row", "array[]" => 'TEXT';
};
my $res = $tweets->scrape( URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169") );
foreach my $elem (@{$res->{array}}){
my $coursenum = substr($elem,2,4);
my $secondLevel = scraper{
process "td.ws-row", "array2[]" => 'TEXT';
};
my $res2 = $secondLevel->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html"));
my $num = @{$res2->{array2}};
print $num;
print "---------------------", "\n";
my @curr = @{$res2->{array2}};
foreach my $elem2 (@curr){
$num++;
print $elem2, " ", "\n";
}
print "---------------------", "\n";
}
Any ideas?
Thanks
Upvotes: 0
Views: 1167
Reputation: 9697
I played a bit with your problem. You can get course id, title and link to individual course page within initial scraper:
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '@href';
};
result 'course';
};
The result of scraping is arrayref with hashrefs like this:
{ id => "CS 103",
title => "Introduction to Programming",
link => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"),
},
....
Then you can do additional scraping for each course from their individual pages and add such information into original structure:
for my $course (@$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
@{$course}{ keys %$additional_data } = values %$additional_data;
}
Source combined together is as follows:
use strict; use warnings;
use URI;
use Web::Scraper;
use Data::Dump qw(dump);
my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169';
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '@href';
};
result 'course';
};
my $res = $courses->scrape(URI->new($url));
for my $course (@$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
@{$course}{ keys %$additional_data } = values %$additional_data;
}
dump $res;
Upvotes: 1
Reputation: 8895
The easiest way to go in this case is use
HTML::TableExtract
In case you are looking for data from the table only.
Upvotes: 1
Reputation: 98398
Looks to me like
my $coursenum = substr($elem,2,4)
should be
my $coursenum = substr($elem,3,3)
Upvotes: 1