traiantomescu
traiantomescu

Reputation: 65

Spider error URL processing

I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.

class GoodWillOutSpider(Spider):

name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]

def __init__(self):
    logging.critical("GoodWillOut STARTED.")

def parse(self, response):
    products = Selector(response).xpath('//div[@id="elasticsearch-results-container"]/ul[@class="product-list clearfix"]')

    for product in products:
        item = GoodWillOutItem()
        item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0]
        item['link'] = "www.thegoodwillout.com" + product.xpath('//@href').extract()[0]
        # item['image'] = "http:" + product.xpath("/div[@class='catalogue-product-cover']/a[@class='catalogue-product-cover-image']/img/@src").extract()[0]
        # item['size'] = '**NOT SUPPORTED YET**'
        yield item

    yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)

This is my class GoodWillOutSpider, and this is the error I get:

[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)

line 1085, in parse item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] IndexError: list index out of range

And I wanna know in the future, how can I get without asking here again the correct xpath for every site

Upvotes: 0

Views: 88

Answers (2)

Druta Ruslan
Druta Ruslan

Reputation: 7402

IndexError: list index out of range

You need to check first if list has any values after extracting

item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()
if item['name']:
    item['name'] = item['name'][0]

Upvotes: 0

stranac
stranac

Reputation: 28206

The problem

If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.

This can mean one of two things:

  • Your scraper is being recognized as such and served different content
  • Some of the content is generated dynamically (usually through javascript)

The generic solution

The most straight-forward way of getting around both of these problems is using an actual browser.

There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.

More specialized solutions

Sometimes, you can figure out what the reason for this different behavior is, and change your code.
This will usually be the more efficient solution, but might require significantly more work on your part.

For example, if your scraper is getting redirected, it is possible that you just need to use a different user agent string, pass some additional headers, or slow down your requests.

If the content is generated by javascript, you might be able to look at the page source (response.text or view source in a browser), and figure out what is going on.

After that, there are two possibilities:

  • Extract the data in an alternate way (like gangabass did for your previous question)
  • Replicate what the javascript is doing in your spider code (such as making additional requests, like in the current example)

Upvotes: 1

Related Questions