Reputation: 543
I'm trying to retrieve up a product name from the following markup:
<h2>
<a href="https://example.com/item/ait-themes-anchor-wordpress-theme/">
<span>AIT Themes</span>
Anchor
<span>WordPress Theme for Campsites</span></a>
<span class="version">2.0.0</span>
</h2>
I want to get the name of product. I am currently using the following xpath:
//a[@class="link-cover"]//parent::div/h2/a/text()
But I am getting the result that is inside the span tag too. This is unwanted.
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Solitudo '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Spa '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' SportClub '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Sushi '>, <Selector xpath='.//text()' data='WordPress Theme'>]
I tried to specify which element I want through the index.
response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()')[1]
But this does not work very well because this specific site has pages that vary in format, but the name of the product is always inside the a tag.
I tried to use the "not" xpath operator, but doesn't return anything.
//a[@class="link-cover"]//parent::div/h2/a/not(span)/text()
EDIT: For reference, I'm calling xpath through scrapy as follows:
def parse_products(self, response):
products = response.xpath('//a[@class="link-cover"]//parent::div/h2/a')
for product in products:
name = product.xpath('.//text()')[1].get()
link = product.xpath(".//@href").get()
yield {
"product_name": name,
"product_link": link,
"product_developer": response.request.meta['developer'],
"product_category": response.request.meta['category']
}
next_page = response.xpath(
'//nav[@class="navigation pagination"]/div[@class="nav-links"]/a[@class="next page-numbers"]/@href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_products, meta={
"developer": response.request.meta['developer'],
"category": response.request.meta['category']
})
Upvotes: 1
Views: 967
Reputation: 5905
Just use //h2/a/text()[normalize-space()]
. Full XPath expression for your website :
//div[@class="new-post-display new-posts2"]//h2/a/text()[normalize-space()]
Output :
Anchor
Aqua
Architect
Arctica
Aspiration
BandZone
Barcelona
BeachClub
Brick
BusinessFinder+
...
EDIT : Your XPath expression works in scrapy shell.
Get the data :
I think the problem is in your spider code. You've posted this as a result :
[<Selector xpath='.//text()' data='AIT Themes'>,...
Replace in your spider .//text()
with ./text()
and you should be OK.
Sidenote : if you want to use an index, fix your XPath accordingly :
response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()[1]')
Upvotes: 1