christof
christof

Reputation: 1

Scrapy: only scrape parts of website

I'd like to scrape parts of a number of very large websites using Scrapy. For instance, from northeastern.edu I would like to scrape only pages that are below the URL http://www.northeastern.edu/financialaid/, such as http://www.northeastern.edu/financialaid/contacts or http://www.northeastern.edu/financialaid/faq. I do not want to scrape the university's entire web site, i.e. http://www.northeastern.edu/faq should not be allowed.

I have no problem with URLs in the format financialaid.northeastern.edu (by simply limiting the allowed_domains to financialaid.northeastern.edu), but the same strategy doesn't work for northestern.edu/financialaid. (The whole spider code is actually longer as it loops through different web pages, I can provide details. Everything works apart from the rules.)

import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from test.items import testItem

class DomainSpider(CrawlSpider):
    name = 'domain'
    allowed_domains = ['northestern.edu/financialaid']
    start_urls = ['http://www.northestern.edu/financialaid/']

    rules = (
        Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
        )

    def parse_item(self, response):
        i = testItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

The results look like this:

    2015-05-12 14:10:46-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: finaid_scraper)
    2015-05-12 14:10:46-0700 [scrapy] INFO: Optional features available: ssl, http11
    2015-05-12 14:10:46-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finaid_scraper.spiders', 'SPIDER_MODULES': ['finaid_scraper.spiders'], 'FEED_URI': '/Users/hugo/Box Sync/finaid/ScrapedSiteText_check/Northeastern.json', 'USER_AGENT': 'stanford_sociology', 'BOT_NAME': 'finaid_scraper'}
    2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled item pipelines: 
    2015-05-12 14:10:46-0700 [graphspider] INFO: Spider opened
    2015-05-12 14:10:46-0700 [graphspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-05-12 14:10:46-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2015-05-12 14:10:46-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
    2015-05-12 14:10:46-0700 [graphspider] DEBUG: Redirecting (301) to <GET http://www.northeastern.edu/financialaid/> from <GET http://www.northeastern.edu/financialaid>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/financialaid/> (referer: None)
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'assistive.usablenet.com': <GET http://assistive.usablenet.com/tt/http://www.northeastern.edu/financialaid/index.html>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.northeastern.edu': <GET http://www.northeastern.edu/financialaid/index.html>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/pages/Boston-MA/NU-Student-Financial-Services/113143082891>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/NUSFS>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'nusfs.wordpress.com': <GET http://nusfs.wordpress.com/>
    2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'northeastern.edu': <GET http://northeastern.edu/howto>
    2015-05-12 14:10:47-0700 [graphspider] INFO: Closing spider (finished)
    2015-05-12 14:10:47-0700 [graphspider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 431,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 9574,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 1,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 12, 21, 10, 47, 94112),
         'log_count/DEBUG': 10,
         'log_count/INFO': 7,
         'offsite/domains': 6,
         'offsite/filtered': 32,
         'request_depth_max': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2015, 5, 12, 21, 10, 46, 566538)}
    2015-05-12 14:10:47-0700 [graphspider] INFO: Spider closed (finished)

The second strategy I attempted was to use allow-rules of the LxmlLinkExtractor and to limit the crawl to everything within the sub-domain, but in that case the entire web page is scraped. (Deny-rules do work.)

import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from test.items import testItem
class DomainSpider(CrawlSpider):
    name = 'domain'
    allowed_domains = ['www.northestern.edu']
    start_urls = ['http://www.northestern.edu/financialaid/']
    rules = (
        Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
        )

    def parse_item(self, response):
        i = testItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

I also tried:

rules = (
    Rule(LxmlLinkExtractor(allow=(r"northeastern.edu/financialaid",)), callback='parse_site', follow=True),
)

The log is too long to be posted here, but these lines show that Scrapy ignores the allow-rule:

2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/tag/north-carolina/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Scraped from <200 http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/>

Here is my items.py:

from scrapy.item import Item, Field

class FinAidScraperItem(Item):
    # define the fields for your item here like:
    url=Field()
    linkedurls=Field()
    internal_linkedurls=Field()
    external_linkedurls=Field()
    http_status=Field()
    title=Field()
    text=Field()

I am using Mac, Python 2.7, Scrapy version 0.24.4. Similar questions have been posted before, but none of the suggested solutions fixed my problem.

Upvotes: 0

Views: 1320

Answers (1)

alecxe
alecxe

Reputation: 473803

You have a typo in your URLs used inside spiders, see:

northeastern

vs

northestern

Here is the spider that worked for me (it follows "financialaid" links only):

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class DomainSpider(CrawlSpider):
    name = 'domain'
    allowed_domains = ['northeastern.edu']
    start_urls = ['http://www.northeastern.edu/financialaid/']

    rules = (
        Rule(LinkExtractor(allow=r"financialaid/"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print response.url

Note that I'm using LinkExtractor shortcut and a string for the allow argument value.

I've also edited your question and fixed the indentation problems assuming they were just "posting" issues.

Upvotes: 2

Related Questions