Reputation: 2806
I use this CrawlerSpider example as 'backbone' for my Crawler.
I want to implement this idea:
First Rule follows the links. Then matched links are passed further to second rule, where second rule matches new links according to the pattern and calls callback on them.
For example, i have Rules:
...
start_urls = ['http://play.google.com/store']
rules = (
Rule(SgmlLinkExtractor(allow=('/store/apps',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
...
How i expect that parser will work:
Open http://play.google.com/store' and matches first URL 'https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free'
Pass found URL ('https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free') to second Rule
Second Rule tries to match it's pattern (allow=('.*/details\?id=',))) and, if it's matched, calls callback 'parse_app' for that URL.
Atm, Crawler just walks through all links and doesn't parse anything.
Upvotes: 0
Views: 102
Reputation: 20748
As Xu Jiawan implies, URLs matching /details\?id=
also match /store/apps
(from what I saw briefly)
So try changing the order of the rules to have the parse_app
Rule match first:
rules = (
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
Rule(SgmlLinkExtractor(allow=('/store/apps',))),
)
Or use deny
rules = (
Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
If you want the first Rule() to be applied only on 'http://play.google.com/store' and then use the second Rule() to call parse_app
, you may need to implement parse_start_url method
to generate Requests using SgmlLinkExtractor(allow=('/store/apps',))
Something like
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class PlaystoreSpider(CrawlSpider):
name = 'playstore'
#allowed_domains = ['example.com']
start_urls = ['https://play.google.com/store']
rules = (
#Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
def parse_app(self, response):
self.log('Hi, this is an app page! %s' % response.url)
# do something
def parse_start_url(self, response):
return [Request(url=link.url)
for link in SgmlLinkExtractor(
allow=('/store/apps',), deny=('/details\?id=',)
).extract_links(response)]
Upvotes: 1