Reputation: 73
I'm trying to parse files much like this one, but for a lot of longitudes and latitudes. The crawler loops through all of the webpages, but doesn't output anything.
Here is my code:
import scrapy
import json
from tutorial.items import DmozItem
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["proadvisorservice.intuit.com"]
min_lat = 35
max_lat = 40
min_long = -100
max_long = -90
def start_requests(self):
for i in range(self.min_lat, self.max_lat):
for j in range(self.min_long, self.max_long):
yield scrapy.Request('http://proadvisorservice.intuit.com/v1/search?latitude=%d&longitude=%d&radius=100&pageNumber=1&pageSize=&sortBy=distance' % (i, j),
meta={'index':(i, j)},
callback=self.parse)
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
for x in jsonresponse['searchResults']:
item = DmozItem()
item['firstName'] = x['firstName']
item['lastName'] = x['lastName']
item['phoneNumber'] = x['phoneNumber']
item['email'] = x['email']
item['companyName'] = x['companyName']
item['qbo'] = x['qbopapCertVersions']
item['qbd'] = x['papCertVersions']
yield item
Upvotes: 0
Views: 697
Reputation: 8614
When using CrawlSpider
you should not override the parse()
method:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work. (source)
But since you are customizing your spider manually, and not using the CrawlSpider
functionality anyway, I would suggest that you don't inherit from it. Instead, inherit from scrapy.Spider
:
class DmozSpider(scrapy.Spider):
...
Upvotes: 1