Reputation: 1608
I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[@id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT: I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2 I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code. Always check if the body of your response is what you were expecting!
SOLVED: Path is correct. I wrote the wrong start_urls!
Upvotes: 1
Views: 1211
Reputation: 21406
Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell
command is perfect for debugging issues like this.
Upvotes: 1
Reputation: 66
//nav[@id="mmenu"]//ul/li[contains(@class,"level0")]/a[contains(@class,"level-top")]/@href
use this xpath, also consider 'view-source' of page before creating xpath
Upvotes: 1