Reputation: 109
First of all thank you if you are reading this.
I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.
I would like to get the data-href of the link, however it needs to consist the
i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class
<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>
I have been using the following:
next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[contains(@class,"cursor")]/@data-href').extract_first()
which works but not for the correct data-href
Many thanks for the help
Full source code:
<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li><a href="/used-truck/1-32/truck-ads.html">1</a></li><li class="active"><a>2</a></li><li><a href="/used-truck/1-32/truck-ads.html?p=3">3</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=12">12</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=22">22</a></li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>
Upvotes: 0
Views: 757
Reputation: 1445
Huh... Turned out to be such a simple case (:
Your mistake is .extract_first()
while you should extract last item to get next page.
next_page = response.xpath('//a[@class="cursor"]/@data-href').extract()[-1]
This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:
pages = response.xpath('//ul[@class="pagination"]//a/@href').extract()
for url in pages:
yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)
And so on..
Upvotes: 1
Reputation: 1445
I'd suggest you to make sure that your element exists in initial html first:
just Ctlr+U
in Chrome and then Ctrl+F
to find element..
If element can be found there - something's wrong with your xpath selector. Else element is generated by javascript and you have to use another way to get the data.
PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U
)
Upvotes: 0
Reputation: 504
try with that :
next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[@class="cursor")]/@data-href').extract_first()
Upvotes: 0