user1592380
user1592380

Reputation: 36277

Best practices with multi page scrapy code

I'm getting started with scrapy. My items.py contains:

class ParkerItem(scrapy.Item):
    account = scrapy.Field()
    m = scrapy.Field()

I then generate requests for the website with:

   for i in range(max_id):

        yield Request('first_url', method="post",  headers= headers, body=payload, callback=self.parse_get_account)


def parse_get_account(self, response):

    j = json.loads(response.body_as_unicode())
    if j['d'][0] != "":
        item = ParkerItem()
        item['account'] = j['d'][0]
        return self.parse_second_request(item)
        print("back here"+str(item))
    print "hello"

If an account number exists, I store it in item and call parse_second_request

def parse_second_request(self, item):

    yield Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request,meta={'item': item})

This calls parse_third_request ( its actually parsing the second )

def parse_third_request(self, response):

    item = response.meta['item'] # {'account': u'11'}
    m = response.selector.xpath('/table//td[3]/text()').extract()
    item["m"] = m[0]

    print("hi"+str(item))
    return item

This code works and the item is passed to a pipeline for storage but it seems like their are a lot of functions for only 2 pages being scraped. Is there a way to simplify the code using best practices?

Upvotes: 1

Views: 392

Answers (1)

alecxe
alecxe

Reputation: 473903

You can avoid having def parse_second_request(self, item): method:

def parse_get_account(self, response):
    j = json.loads(response.body_as_unicode())
    if j['d'][0] != "":
        item = ParkerItem()
        item['account'] = j['d'][0]
        return Request(method="GET", url=(url + '?' + urllib.urlencode(querystring)), headers=headers, callback=self.parse_third_request, meta={'item': item})

Aside from that, since your item fields are being filled from the data coming from different pages, you are doing it correctly.

Upvotes: 2

Related Questions