spjthe
spjthe

Reputation: 193

Scrapy selectors return all on page instead of relative

I am using Scrapy to crawl a website which has a list of items on it. However when looping over the list of items, asking for a relative xpath returns all matching items for the entire the page. I have been using 0.24, however upgrading to the latest (1.0) encounters the same issue.

I have tried running this with virtualenv to avoid conflicts with other libraries on my system with no success.

for sel in response.xpath('//ul[@class="items"]//div[@class="item"]'):
    item = CrawledItem()
    item['id'] = sel.xpath('.//input[@name="id"]/@value').extract()

I have tried debugging using scrapy parse and noticed that the list of ids starts off with all matching and slowly decreases so by the last item it only matches a single id. I was expecting a single id per item, instead I'm getting a response similar to below.

[
    {
        'id': [1,2,3,4,5,6,7,8,9,10]
    },
    {
        'id': [1,2,3,4,5,6,7,8,9]
    },
    [..] // omitted
    {
        'id': [10]
    }
]

I have also tried with css selectors with no success. My understanding was that .// was used to perform this action. How can I make sure that I'm ONLY selecting relative to the current selector?

Upvotes: 2

Views: 620

Answers (1)

Frank Martin
Frank Martin

Reputation: 2594

How can I make sure that I'm ONLY selecting relative to the current selector?

Choose your selector wisely ;-)

Indeed the page behaves contra-intuitive and it seems that relative selection does not work. As fas as I inspected it you can get the productId with following code which uses a deeper nested selector:

from scrapy import Spider

class TestSpider(Spider):

    name= 'test_spider'
    start_urls = ['http://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/ham-82654-44']

    def parse(self, response):

        # print response.body

        xpath_products = '//div[@class="addToTrolleyForm "]'

        for sel in response.xpath(xpath_products):
            src = sel.xpath('.//input[@name="productId"]/@value').extract()
            print src

While not providing a solution to your problem - Sorry, I'd recommend to inspect the response.body closely.

Upvotes: 1

Related Questions