Dmitrijs Zubriks
Dmitrijs Zubriks

Reputation: 2806

How to make the two CrawlerSpider Rules to cooperate

I use this CrawlerSpider example as 'backbone' for my Crawler.

I want to implement this idea:

First Rule follows the links. Then matched links are passed further to second rule, where second rule matches new links according to the pattern and calls callback on them.

For example, i have Rules:

...

start_urls = ['http://play.google.com/store']

rules = (
    Rule(SgmlLinkExtractor(allow=('/store/apps',))),
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)

...

How i expect that parser will work:

  1. Open http://play.google.com/store' and matches first URL 'https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free'

  2. Pass found URL ('https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free') to second Rule

  3. Second Rule tries to match it's pattern (allow=('.*/details\?id=',))) and, if it's matched, calls callback 'parse_app' for that URL.

Atm, Crawler just walks through all links and doesn't parse anything.

Upvotes: 0

Views: 102

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

As Xu Jiawan implies, URLs matching /details\?id= also match /store/apps (from what I saw briefly)

So try changing the order of the rules to have the parse_app Rule match first:

rules = (
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
    Rule(SgmlLinkExtractor(allow=('/store/apps',))),
)

Or use deny

rules = (
    Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)

If you want the first Rule() to be applied only on 'http://play.google.com/store' and then use the second Rule() to call parse_app, you may need to implement parse_start_url method to generate Requests using SgmlLinkExtractor(allow=('/store/apps',))

Something like

from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class PlaystoreSpider(CrawlSpider):
    name = 'playstore'
    #allowed_domains = ['example.com']
    start_urls = ['https://play.google.com/store']

    rules = (
        #Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
        Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
    )

    def parse_app(self, response):
        self.log('Hi, this is an app page! %s' % response.url)
        # do something


    def parse_start_url(self, response):
        return [Request(url=link.url)
                for link in SgmlLinkExtractor(
                    allow=('/store/apps',), deny=('/details\?id=',)
                ).extract_links(response)]

Upvotes: 1

Related Questions