Reputation: 65
I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.
class GoodWillOutSpider(Spider):
name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]
def __init__(self):
logging.critical("GoodWillOut STARTED.")
def parse(self, response):
products = Selector(response).xpath('//div[@id="elasticsearch-results-container"]/ul[@class="product-list clearfix"]')
for product in products:
item = GoodWillOutItem()
item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0]
item['link'] = "www.thegoodwillout.com" + product.xpath('//@href').extract()[0]
# item['image'] = "http:" + product.xpath("/div[@class='catalogue-product-cover']/a[@class='catalogue-product-cover-image']/img/@src").extract()[0]
# item['size'] = '**NOT SUPPORTED YET**'
yield item
yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)
This is my class GoodWillOutSpider, and this is the error I get:
[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)
line 1085, in parse item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] IndexError: list index out of range
And I wanna know in the future, how can I get without asking here again the correct xpath for every site
Upvotes: 0
Views: 88
Reputation: 7402
IndexError: list index out of range
You need to check first if list has any values after extracting
item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()
if item['name']:
item['name'] = item['name'][0]
Upvotes: 0
Reputation: 28206
If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.
This can mean one of two things:
The most straight-forward way of getting around both of these problems is using an actual browser.
There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.
Sometimes, you can figure out what the reason for this different behavior is, and change your code.
This will usually be the more efficient solution, but might require significantly more work on your part.
For example, if your scraper is getting redirected, it is possible that you just need to use a different user agent string, pass some additional headers, or slow down your requests.
If the content is generated by javascript, you might be able to look at the page source (response.text
or view source in a browser), and figure out what is going on.
After that, there are two possibilities:
Upvotes: 1