Reputation: 65

Spider error URL processing

I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.

class GoodWillOutSpider(Spider):

name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]

def __init__(self):
    logging.critical("GoodWillOut STARTED.")

def parse(self, response):
    products = Selector(response).xpath('//div[@id="elasticsearch-results-container"]/ul[@class="product-list clearfix"]')

    for product in products:
        item = GoodWillOutItem()
        item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0]
        item['link'] = "www.thegoodwillout.com" + product.xpath('//@href').extract()[0]
        # item['image'] = "http:" + product.xpath("/div[@class='catalogue-product-cover']/a[@class='catalogue-product-cover-image']/img/@src").extract()[0]
        # item['size'] = '**NOT SUPPORTED YET**'
        yield item

    yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)

This is my class GoodWillOutSpider, and this is the error I get:

[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)

line 1085, in parse item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] IndexError: list index out of range

And I wanna know in the future, how can I get without asking here again the correct xpath for every site

Upvotes: 0

Answers (2)

Druta Ruslan

Reputation: 7412

IndexError: list index out of range

You need to check first if list has any values after extracting

item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()
if item['name']:
    item['name'] = item['name'][0]

Upvotes: 0

stranac

Reputation: 28256

The problem

If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.

This can mean one of two things:

Your scraper is being recognized as such and served different content
Some of the content is generated dynamically (usually through javascript)

The generic solution

The most straight-forward way of getting around both of these problems is using an actual browser.

There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.

Spider error URL processing

Answers (2)

The problem

The generic solution

More specialized solutions

Related Questions