Reputation: 193
I am using Scrapy to crawl a website which has a list of items on it. However when looping over the list of items, asking for a relative xpath returns all matching items for the entire the page. I have been using 0.24, however upgrading to the latest (1.0) encounters the same issue.
I have tried running this with virtualenv
to avoid conflicts with other libraries on my system with no success.
for sel in response.xpath('//ul[@class="items"]//div[@class="item"]'):
item = CrawledItem()
item['id'] = sel.xpath('.//input[@name="id"]/@value').extract()
I have tried debugging using scrapy parse
and noticed that the list of ids starts off with all matching and slowly decreases so by the last item it only matches a single id. I was expecting a single id per item, instead I'm getting a response similar to below.
[
{
'id': [1,2,3,4,5,6,7,8,9,10]
},
{
'id': [1,2,3,4,5,6,7,8,9]
},
[..] // omitted
{
'id': [10]
}
]
I have also tried with css selectors with no success. My understanding was that .//
was used to perform this action. How can I make sure that I'm ONLY selecting relative to the current selector?
Upvotes: 2
Views: 620
Reputation: 2594
How can I make sure that I'm ONLY selecting relative to the current selector?
Choose your selector wisely ;-)
Indeed the page behaves contra-intuitive and it seems that relative selection does not work. As fas as I inspected it you can get the productId
with following code which uses a deeper nested selector:
from scrapy import Spider
class TestSpider(Spider):
name= 'test_spider'
start_urls = ['http://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/ham-82654-44']
def parse(self, response):
# print response.body
xpath_products = '//div[@class="addToTrolleyForm "]'
for sel in response.xpath(xpath_products):
src = sel.xpath('.//input[@name="productId"]/@value').extract()
print src
While not providing a solution to your problem - Sorry, I'd recommend to inspect the response.body closely.
Upvotes: 1