Scrapy spider can't find URLs that load on click

Question

I'm trying to scrape data from this page - http://catalog.umassd.edu/content.php?catoid=45&navoid=3554

I want to expand every section with the 'Display courses for this department' link and then get the course information (text) for each course on that page.

I've written the following script:

 from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
 from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
 from scrapy.selector import Selector
 from scrapy.http import HtmlResponse

 from courses.items import Course


class EduSpider(CrawlSpider):
    name = 'umassd.edu'
    allowed_domains = ['umassd.edu']
    start_urls = ['http://catalog.umassd.edu/content.php']

    rules = (Rule(LxmlLinkExtractor(
         allow=('.*/http://catalog.umassd.edu/preview_course.php?
         catoid=[0-9][0-9]&coid=[0-9][0-9][0-9][0-9][0-9][0-9]', ),
         ), callback='parse_item'),

    def parse_item(self, response):
        item = Course()
        print (response)

Now, no matter what start_url I give, the spider can't seem to ever reach the preview_course.php links - I tried a few variations. The script exits without crawling any /content.php pages at all.

This is for educational purposes only.

Granitosaurus · Accepted Answer

The urls you are looking for are retrieved via AJAX requests. If you open up your browsers dev tools and go to "networks" tab you'll see a requests being made when you click the button, to something like:

http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide=show&cat_oid=45&nav_oid=3554&ent_oid=2027&type=c&link_text=this%20department

This url is generated by javascript and then it's content is downloaded and injected in your page.
Since scrapy does not execute any of the javascript you need to recreate this url yourself. Fortunately it's very easy to reverse engineer this in your case.

If you inspect the html source you can see that "display courses for this department" link node has some interesting stuff on it:

Scrapy spider can't find URLs that load on click

Answers (1)

Related Questions

Scrapy spider can&#39;t find URLs that load on click

Answers (1)

Related Questions

Scrapy spider can't find URLs that load on click