Reputation: 36277
I'm getting started with scrapy. My items.py contains:
class ParkerItem(scrapy.Item):
account = scrapy.Field()
m = scrapy.Field()
I then generate requests for the website with:
for i in range(max_id):
yield Request('first_url', method="post", headers= headers, body=payload, callback=self.parse_get_account)
def parse_get_account(self, response):
j = json.loads(response.body_as_unicode())
if j['d'][0] != "":
item = ParkerItem()
item['account'] = j['d'][0]
return self.parse_second_request(item)
print("back here"+str(item))
print "hello"
If an account number exists, I store it in item and call parse_second_request
def parse_second_request(self, item):
yield Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request,meta={'item': item})
This calls parse_third_request ( its actually parsing the second )
def parse_third_request(self, response):
item = response.meta['item'] # {'account': u'11'}
m = response.selector.xpath('/table//td[3]/text()').extract()
item["m"] = m[0]
print("hi"+str(item))
return item
This code works and the item is passed to a pipeline for storage but it seems like their are a lot of functions for only 2 pages being scraped. Is there a way to simplify the code using best practices?
Upvotes: 1
Views: 392
Reputation: 473903
You can avoid having def parse_second_request(self, item):
method:
def parse_get_account(self, response):
j = json.loads(response.body_as_unicode())
if j['d'][0] != "":
item = ParkerItem()
item['account'] = j['d'][0]
return Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request, meta={'item': item})
Aside from that, since your item fields are being filled from the data coming from different pages, you are doing it correctly.
Upvotes: 2