Reputation: 1250
I'm trying to scrape data from this page - http://catalog.umassd.edu/content.php?catoid=45&navoid=3554
I want to expand every section with the 'Display courses for this department' link and then get the course information (text) for each course on that page.
I've written the following script:
from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from courses.items import Course
class EduSpider(CrawlSpider):
name = 'umassd.edu'
allowed_domains = ['umassd.edu']
start_urls = ['http://catalog.umassd.edu/content.php']
rules = (Rule(LxmlLinkExtractor(
allow=('.*/http://catalog.umassd.edu/preview_course.php?
catoid=[0-9][0-9]&coid=[0-9][0-9][0-9][0-9][0-9][0-9]', ),
), callback='parse_item'),
def parse_item(self, response):
item = Course()
print (response)
Now, no matter what start_url I give, the spider can't seem to ever reach the preview_course.php links - I tried a few variations.
The script exits without crawling any /content.php
pages at all.
This is for educational purposes only.
Upvotes: 3
Views: 1519
Reputation: 21446
The urls you are looking for are retrieved via AJAX requests. If you open up your browsers dev tools and go to "networks" tab you'll see a requests being made when you click the button, to something like:
This url is generated by javascript and then it's content is downloaded and injected in your page.
Since scrapy does not execute any of the javascript you need to recreate this url yourself. Fortunately it's very easy to reverse engineer this in your case.
If you inspect the html source you can see that "display courses for this department" link node has some interesting stuff on it:
<a href="#"
target="_blank"
onclick="showHideFilterData(this, 'show', '45', '3554', '2027', 'c', 'this department'); return false;>
Display courses for this department.</a>
We can see that when we click some javascript function happens and if we compare this to the url we have above you can clearly see some similarities.
Now we can recreate this url using this data:
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://catalog.umassd.edu/content.php?catoid=45&navoid=3554']
def parse(self, response):
# get "onclick" java function of every "show more" link
# and extract parameters supplied to this function with regular expressions
links = response.xpath("//a/@onclick[contains(.,'showHide')]")
for link in links:
args = link.re("'(.+?)'")
# make our url by putting arguments from page source
# into a template of an url
url = 'http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide={}&cat_oid={}&nav_oid={}&ent_oid={}&type={}&link_text={}'.format(*args)
yield scrapy.Request(url, self.parse_more)
def parse_more(self, response):
# here you'll get page source with all of the links
Upvotes: 3