Reputation: 1499
I am trying to crawl some websites using scrapy. Below is a sample code. The method parse is not getting called. I am trying to run the code through a reactor service ( code provided ). So, I run it from startCrawling.py which has the reactor. I know that I am missing something. Could you please help out.
Thanks,
Code-categorization.py
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from items.items import CategorizationItem
from scrapy.contrib.spiders.crawl import CrawlSpider
class TestingSpider(CrawlSpider):
print 'in spider'
name = 'testSpider'
allowed_domains = ['wikipedia.org']
start_urls = ['http://www.wikipedia.org']
def parse(self, response):
# Scrape data from page
print 'here'
open('test.html','wb').write(response.body)
Code- startCrawling.py
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from spiders.categorization import TestingSpider
# Scrapy spiders script...
def stop_reactor():
reactor.stop #@UndefinedVariable
print 'hi'
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = TestingSpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
reactor.run()#@UndefinedVariable
Upvotes: 0
Views: 197
Reputation: 8624
You are not supposed to override the parse()
method when using the CrawlSpider
. You should set a custom callback
in your Rule
with a different name.
Here is the excerpt from the official documentation:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
Upvotes: 2