Bas Mienis
Bas Mienis

Reputation: 109

Scrapy Xpath getting the correct pagination

First of all thank you if you are reading this.

I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.

I would like to get the data-href of the link, however it needs to consist the

i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class

<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>

I have been using the following:

next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[contains(@class,"cursor")]/@data-href').extract_first()

which works but not for the correct data-href

Many thanks for the help

Full source code:

<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li><a href="/used-truck/1-32/truck-ads.html">1</a></li><li class="active"><a>2</a></li><li><a href="/used-truck/1-32/truck-ads.html?p=3">3</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=12">12</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=22">22</a></li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>

Upvotes: 0

Views: 757

Answers (3)

Michael Savchenko
Michael Savchenko

Reputation: 1445

Huh... Turned out to be such a simple case (:

Your mistake is .extract_first() while you should extract last item to get next page.

next_page = response.xpath('//a[@class="cursor"]/@data-href').extract()[-1]

This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:

pages = response.xpath('//ul[@class="pagination"]//a/@href').extract()
for url in pages:
    yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)

And so on..

Upvotes: 1

Michael Savchenko
Michael Savchenko

Reputation: 1445

I'd suggest you to make sure that your element exists in initial html first:

just Ctlr+U in Chrome and then Ctrl+F to find element..

If element can be found there - something's wrong with your xpath selector. Else element is generated by javascript and you have to use another way to get the data.

PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U)

Upvotes: 0

user_1330
user_1330

Reputation: 504

try with that :

next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[@class="cursor")]/@data-href').extract_first()

Upvotes: 0

Related Questions