Reputation: 803
I have the following source code from which I am attempting to extract my desired information:
<div id="PaginationBottom" class="pagination">
<a href="#" data-page="2" title="page 2 of 31" >2</a>
<a href="#" data-page="3" title="page 3 of 31" >3</a>
<a href="#" data-page="4" title="page 4 of 31" >4</a>
<a href="#" data-page="10" title="page 10 of 31" >10</a>
<a href="#" data-page="2" title="page 2 of 31" class="next" >next »</a>
</div>
What I want to extract is the title="page 2 of 31"
information from within the final tag itself. I can get the tag with the following code:
response.xpath('//div[@id="PaginationBottom"]//a[@class="next"]').extract()
Thus, what I'd like to know is whether it is possible to extract a parameter's text from within the tag itself. Is it? I can't find information on this anywhere, but I'm brand new to xpath and don't know the best search terms. Thanks for any help!
Upvotes: 0
Views: 60
Reputation: 5031
Try a simple one like this:(htmltext is the text you want to parse)
regex1 = '<a href="#" data-page="2"(.+?)>2</a>'
pattern1 = re.compile(regex1)
Extracted_Text = re.findall(pattern1,htmltext)
print Extracted_Text
This code extracts everything between <a href="#" data-page="2"
and >2</a>
Output would be like:
title="page 2 of 31" and so...
Upvotes: 0
Reputation: 473863
Add /@title
to the end of your xpath expression:
//div[@id="PaginationBottom"]//a[@class="next"]/@title
Demo from the scrapy shell
:
>>> response.xpath('//div[@id="PaginationBottom"]//a[@class="next"]/@title').extract()
[u'page 2 of 31']
Just a follow up. You would probably want to get the maximum number of pages from the title
attribute value, 31
out of the page 2 of 31
. Scrapy Selector's built-in re()
method would be helpful here:
>>> response.xpath('//div[@id="PaginationBottom"]/a[@class="next"]/@title').re('page \d+ of (\d+)')
[u'31']
Upvotes: 2