Order of iterations in start urls scrapy

Question

I have a list of urls in a csv file, I load this file in a pandas data frame, and use the column links as start urls

start_urls =  df['Links']

each link have this format

http://www.bbb.org/search/?type=name&input=%28408%29+998-0983&location=&tobid=&filter=business&radius=&country=USA%2CCAN&language=en&codeType=YPPA

this link is related with a phone number (408) 998-0983 which apears in the link as %28408%29+998-0983

for each of the pages in df['Links'] I scrap some data, and save it in an item, so far so good, the problem I have is that the order in wich scrapy takes the list is not the same that is is the data frame, so I can't merge the data I get with scrapy and the file that I already have becauses the rows doesn't match, also I would like to handle the exception when the page doesn't have the data an return a string, in wich part of the code could I do that, this is what I'm doing right now:

def parse(self, response):




    producto = Product()
    producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()

Steve · Accepted Answer

The first part of your question is answered here, which suggests over-riding start_requests() to add meta data. In your case I imagine you could add the phone number as meta data but any convenient link to your data frame would do. The order of the scraped data wont change but you will have enough information to relate to the original data in a database or spreadsheet.

class MySpider(CrawlSpider):

    def start_requests(self):
        ...
        yield Request(url1, meta={'phone_no': '(408) 998-0983'}, callback=self.parse)
        ...

def parse(self, response):
    item['phone_no'] = response.meta['phone_no']

For the case where no data is found you could test the list returned by your xpath. If it's empty then nothing was found.

producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()
if producto:
    
    item['status'] = 'found ok'
else:
    item['status'] = 'not found'

yield item

Order of iterations in start urls scrapy

Answers (2)

Related Questions