Mattia Surricchio
Mattia Surricchio

Reputation: 1608

Scrapy - Xpath works in shell but not in code

I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.

I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.

import scrapy


class MySpider(scrapy.Spider):
    name = 'MySpider'

    start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
        'https://www.app4health.it/',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        print ('PRE RISULTATI')

        results =  response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()
        # results = response.css('li a>href').extract()


        # This works on scrapy shell, not in code
        #risultati =  response.xpath('//*[@id="nav"]/ol/li[1]/a').extract()
        print (risultati)




        #for pagineitems in risultati:
               # next_page = pagineitems 
        print ('NEXT PAGE')
        #Ignores the request cause already done. Insert dont filter
        yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)

    def prodotti(self, response):
        self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
        return 1

The website i'm trying to crawl is https://shop.app4health.it/

The xpath command that i use is this one :

response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()

I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.

If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).

Thanks for the help :)

EDIT: I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!

EDIT 2 I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code. Always check if the body of your response is what you were expecting!

SOLVED: Path is correct. I wrote the wrong start_urls!

Upvotes: 1

Views: 1211

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21406

Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:

$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]: 
['https://shop.app4health.it/sonno',
 'https://shop.app4health.it/monitoraggio-e-diagnostica',
 'https://shop.app4health.it/terapia',
 'https://shop.app4health.it/integratori-alimentari',
 'https://shop.app4health.it/fitness',
 'https://shop.app4health.it/benessere',
 'https://shop.app4health.it/ausili',
 'https://shop.app4health.it/prodotti-in-offerta',
 'https://shop.app4health.it/kit-regalo']

scrapy shell command is perfect for debugging issues like this.

Upvotes: 1

Desperado
Desperado

Reputation: 66

    //nav[@id="mmenu"]//ul/li[contains(@class,"level0")]/a[contains(@class,"level-top")]/@href 

use this xpath, also consider 'view-source' of page before creating xpath

Upvotes: 1

Related Questions