Best practices with multi page scrapy code

Question

I'm getting started with scrapy. My items.py contains:

class ParkerItem(scrapy.Item):
    account = scrapy.Field()
    m = scrapy.Field()

I then generate requests for the website with:

   for i in range(max_id):

        yield Request('first_url', method="post",  headers= headers, body=payload, callback=self.parse_get_account)


def parse_get_account(self, response):

    j = json.loads(response.body_as_unicode())
    if j['d'][0] != "":
        item = ParkerItem()
        item['account'] = j['d'][0]
        return self.parse_second_request(item)
        print("back here"+str(item))
    print "hello"

If an account number exists, I store it in item and call parse_second_request

def parse_second_request(self, item):

    yield Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request,meta={'item': item})

This calls parse_third_request ( its actually parsing the second )

def parse_third_request(self, response):

    item = response.meta['item'] # {'account': u'11'}
    m = response.selector.xpath('/table//td[3]/text()').extract()
    item["m"] = m[0]

    print("hi"+str(item))
    return item

This code works and the item is passed to a pipeline for storage but it seems like their are a lot of functions for only 2 pages being scraped. Is there a way to simplify the code using best practices?

alecxe · Accepted Answer

You can avoid having def parse_second_request(self, item): method:

def parse_get_account(self, response):
    j = json.loads(response.body_as_unicode())
    if j['d'][0] != "":
        item = ParkerItem()
        item['account'] = j['d'][0]
        return Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request, meta={'item': item})

Aside from that, since your item fields are being filled from the data coming from different pages, you are doing it correctly.

Best practices with multi page scrapy code

Answers (1)

Related Questions