Reputation: 371
I'm currently trying to work with the Scrapy framework to simply collect a bunch of URLs that I can store and sort later. However, I can't seem to get URLs to print or be stored in a file on callback, no matter what I've tried and adapted from other tutorials. Here's currently what I'm going for with my spider class for this particular example, choosing a small site:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from crawler.items import CrawlerItem
from scrapy import log
class CrawlerSpider(CrawlSpider):
name = 'crawler'
allowed_domains = [
"glauberkotaki.com"
]
start_urls = [
"http://www.glauberkotaki.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), deny=('about.html'))),
Rule(SgmlLinkExtractor(allow=('about.html')), callback='parseLink', follow="yes"),
)
def parseLink(self, response):
x = HtmlXPathSelector(response)
print(response.url)
print("\n")
It crawls all of the pages of this site fine, but it doesn't print out anything at all, even when it comes across the webpage "www.glauberkotaki.com/about.html", which is what I was trying to test the code with. It seems to me there's an issue with the callback being called.
Upvotes: 0
Views: 93
Reputation: 55932
I don't think your second rule is ever being called, from the docs:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
Because the first one matches the about.html
the second one's callback is not being fired.
I believe adding the callback to the first rule will work
Rule(SgmlLinkExtractor(allow=(), deny=('about.html'), callback='parseLink')),
or if you just want to test the callback against the about page remove the first rule
Upvotes: 1