Order of iterations in start urls scrapy

I have a list of urls in a csv file, I load this file in a pandas data frame, and use the column links as start urls

start_urls =  df['Links']

each link have this format

http://www.bbb.org/search/?type=name&input=%28408%29+998-0983&location=&tobid=&filter=business&radius=&country=USA%2CCAN&language=en&codeType=YPPA

this link is related with a phone number (408) 998-0983 which apears in the link as %28408%29+998-0983

for each of the pages in df['Links'] I scrap some data, and save it in an item, so far so good, the problem I have is that the order in wich scrapy takes the list is not the same that is is the data frame, so I can't merge the data I get with scrapy and the file that I already have becauses the rows doesn't match, also I would like to handle the exception when the page doesn't have the data an return a string, in wich part of the code could I do that, this is what I'm doing right now:

def parse(self, response):




    producto = Product()
    producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()

Upvotes: 1

Views: 297

Answers (2)

Steve
Steve

Reputation: 976

The first part of your question is answered here, which suggests over-riding start_requests() to add meta data. In your case I imagine you could add the phone number as meta data but any convenient link to your data frame would do. The order of the scraped data wont change but you will have enough information to relate to the original data in a database or spreadsheet.

class MySpider(CrawlSpider):

    def start_requests(self):
        ...
        yield Request(url1, meta={'phone_no': '(408) 998-0983'}, callback=self.parse)
        ...

def parse(self, response):
    item['phone_no'] = response.meta['phone_no']

For the case where no data is found you could test the list returned by your xpath. If it's empty then nothing was found.

producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()
if producto:
    <parse the page as normal>
    item['status'] = 'found ok'
else:
    item['status'] = 'not found'

yield item

Upvotes: 1

user4125604
user4125604

Reputation:

Scrapy works in an asynchronous kind, that is why your idea doesn't work. A working solution would be to save the 'request.url' or 'response.url' together with the scraped result in a freshly generated output.csv

For the 2nd part of your question, have you tried try & except:

producto = Product()
try:
    producto = Product(BBB_link = response.xpath('//*[@id="container"]/div/div[1]/div[3]/table/tbody/tr[1]/td/h4[1]/a').extract()
except:
    producto = 'n/a'

Upvotes: 1

Related Questions