Reputation: 1
I'm trying to use a spider crawler code to get some real estate data. But it keeps giving me this error:
Traceback (most recent call last):
File "//anaconda/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks result = g.send(result)
File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl six.reraise(*exc_info)
File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl self.spider = self._create_spider(*args, **kwargs)
File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider return self.spidercls.from_crawler(self, *args, **kwargs)
File "//anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 96, in from_crawler spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "//anaconda/lib/python2.7/site-packages/scrapy/spiders/init.py", line 50, in from_crawler spider = cls(*args, **kwargs)
TypeError: init() takes exactly 3 arguments (1 given)
Here is the code of defining the crawler:
class RealestateSpider(scrapy.spiders.CrawlSpider):
###Real estate web crawler
name = 'buyrentsold'
allowed_domains = ['realestate.com.au']
def __init__(self, command, search):
search = re.sub(r'\s+', '+', re.sub(',+', '%2c', search)).lower()
url = '/{0}/in-{{0}}{{{{0}}}}/list-{{{{1}}}}'.format(command)
start_url = 'http://www.{0}{1}'
start_url = start_url.format(
self.allowed_domains[0], url.format(search)
)
self.start_urls = [start_url.format('', 1)]
extractor = scrapy.linkextractors.sgml.SgmlLinkExtractor(
allow=url.format(re.escape(search)).format('.*', '')
)
rule = scrapy.spiders.Rule(
extractor, callback='parse_items', follow=True
)
self.rules = [rule]
super(RealestateSpider, self).__init__()
def parse_items(self, response):
###Parse a page of real estate listings
hxs = scrapy.selector.HtmlXPathSelector(response)
for i in hxs.select('//div[contains(@class, "listingInfo")]'):
item = RealestateItem()
path = 'div[contains(@class, "propertyStats")]//text()'
item['price'] = i.select(path).extract()
vcard = i.select('div[contains(@class, "vcard")]//a')
item['address'] = vcard.select('text()').extract()
url = vcard.select('@href').extract()
if len(url) == 1:
item['url'] = 'http://www.{0}{1}'.format(
self.allowed_domains[0], url[0]
)
features = i.select('dl')
for field in ('bed', 'bath', 'car'):
path = '(@class, "rui-icon-{0}")'.format(field)
path = 'dt[contains{0}]'.format(path)
path = '{0}/following-sibling::dd[1]'.format(path)
path = '{0}/text()'.format(path)
item[field] = features.select(path).extract() or 0
yield item
Here is when the erorr came up:
crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings)
sp=RealestateSpider(command, search)
crawler.crawl(sp)
crawler.start()
Can anyone help me with this problem? Thanks!
Upvotes: 0
Views: 1042
Reputation: 11
I ran into this exact problem and the above solution was too difficult for me.
But I sidestepped the issue by passing the arguments as a class attribute:
process = CrawlerProcess()
sp = MySpider
sp.storage = reference_to_datastorage
process.crawl(sp)
process.start()
Hopefully that would be a potential solution to your issue.
Upvotes: 1
Reputation: 21406
crawler.crawl()
method requires spider class as an argument, where's in your code a spider object is provided.
There are several ways of doing this right, but the most straight-forward way would be simply to extend the spider class:
class MySpider(Spider):
command = None
search = None
def __init__(self):
# do something with self.command and self.search
super(RealestateSpider, self).__init__()
And then:
crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings)
class MySpider(RealestateSpider):
command = 'foo'
search = 'bar'
crawler.crawl(MySpider)
crawler.start()
Upvotes: 2