boltthrower
boltthrower

Reputation: 1250

Scrapy spider can't find URLs that load on click

I'm trying to scrape data from this page - http://catalog.umassd.edu/content.php?catoid=45&navoid=3554

I want to expand every section with the 'Display courses for this department' link and then get the course information (text) for each course on that page.

I've written the following script:

 from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
 from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
 from scrapy.selector import Selector
 from scrapy.http import HtmlResponse

 from courses.items import Course


class EduSpider(CrawlSpider):
    name = 'umassd.edu'
    allowed_domains = ['umassd.edu']
    start_urls = ['http://catalog.umassd.edu/content.php']

    rules = (Rule(LxmlLinkExtractor(
         allow=('.*/http://catalog.umassd.edu/preview_course.php?
         catoid=[0-9][0-9]&coid=[0-9][0-9][0-9][0-9][0-9][0-9]', ),
         ), callback='parse_item'),

    def parse_item(self, response):
        item = Course()
        print (response)

Now, no matter what start_url I give, the spider can't seem to ever reach the preview_course.php links - I tried a few variations. The script exits without crawling any /content.php pages at all.

This is for educational purposes only.

Upvotes: 3

Views: 1519

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

The urls you are looking for are retrieved via AJAX requests. If you open up your browsers dev tools and go to "networks" tab you'll see a requests being made when you click the button, to something like:

http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide=show&cat_oid=45&nav_oid=3554&ent_oid=2027&type=c&link_text=this%20department

This url is generated by javascript and then it's content is downloaded and injected in your page.
Since scrapy does not execute any of the javascript you need to recreate this url yourself. Fortunately it's very easy to reverse engineer this in your case.

If you inspect the html source you can see that "display courses for this department" link node has some interesting stuff on it:

<a href="#" 
target="_blank" 
onclick="showHideFilterData(this, 'show', '45', '3554', '2027', 'c', 'this department'); return false;>
Display courses for this department.</a>

We can see that when we click some javascript function happens and if we compare this to the url we have above you can clearly see some similarities.

Now we can recreate this url using this data:

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://catalog.umassd.edu/content.php?catoid=45&navoid=3554']

    def parse(self, response):
        # get "onclick" java function of every "show more" link
        # and extract parameters supplied to this function with regular expressions
        links = response.xpath("//a/@onclick[contains(.,'showHide')]")
        for link in links:
            args = link.re("'(.+?)'")
            # make our url by putting arguments from page source 
            # into a template of an url
            url = 'http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide={}&cat_oid={}&nav_oid={}&ent_oid={}&type={}&link_text={}'.format(*args)
            yield scrapy.Request(url, self.parse_more) 

    def parse_more(self, response):
        # here you'll get page source with all of the links

Upvotes: 3

Related Questions