pnv
pnv

Reputation: 1499

Python Scrapy- Not able to crawl

I am trying to crawl some websites using scrapy. Below is a sample code. The method parse is not getting called. I am trying to run the code through a reactor service ( code provided ). So, I run it from startCrawling.py which has the reactor. I know that I am missing something. Could you please help out.

Thanks,

Code-categorization.py

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from items.items import CategorizationItem
from scrapy.contrib.spiders.crawl import CrawlSpider
class TestingSpider(CrawlSpider):
         print 'in spider'
         name = 'testSpider'
         allowed_domains = ['wikipedia.org']
         start_urls = ['http://www.wikipedia.org']
         def parse(self, response):

             # Scrape data from page
             print 'here'
             open('test.html','wb').write(response.body)

Code- startCrawling.py

 from twisted.internet import reactor
 from scrapy.crawler import Crawler
 from scrapy.settings import Settings
 from scrapy import log, signals
 from scrapy.xlib.pydispatch import dispatcher
 from scrapy.utils.project import get_project_settings

 from spiders.categorization import TestingSpider

 # Scrapy spiders script...

 def stop_reactor():
     reactor.stop #@UndefinedVariable    
     print 'hi'

     dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
     spider = TestingSpider()
     crawler = Crawler(Settings())
     crawler.configure()
     crawler.crawl(spider)
     crawler.start()
     reactor.run()#@UndefinedVariable

Upvotes: 0

Views: 197

Answers (1)

bosnjak
bosnjak

Reputation: 8624

You are not supposed to override the parse() method when using the CrawlSpider. You should set a custom callback in your Rule with a different name.
Here is the excerpt from the official documentation:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Upvotes: 2

Related Questions