Émerson Felinto
Émerson Felinto

Reputation: 543

Get text from XPath of immediate child

I'm trying to retrieve up a product name from the following markup:

<h2>
     <a href="https://example.com/item/ait-themes-anchor-wordpress-theme/">
             <span>AIT Themes</span> 
                   Anchor 
             <span>WordPress Theme for Campsites</span></a>
             <span class="version">2.0.0</span>
</h2>

I want to get the name of product. I am currently using the following xpath:

//a[@class="link-cover"]//parent::div/h2/a/text()

But I am getting the result that is inside the span tag too. This is unwanted.

[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Solitudo '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Spa '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' SportClub '>, <Selector xpath='.//text()' data='WordPress Theme'>]
[<Selector xpath='.//text()' data='AIT Themes'>, <Selector xpath='.//text()' data=' Sushi '>, <Selector xpath='.//text()' data='WordPress Theme'>]

I tried to specify which element I want through the index.

response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()')[1]

But this does not work very well because this specific site has pages that vary in format, but the name of the product is always inside the a tag.

I tried to use the "not" xpath operator, but doesn't return anything.

//a[@class="link-cover"]//parent::div/h2/a/not(span)/text()

EDIT: For reference, I'm calling xpath through scrapy as follows:

    def parse_products(self, response):

        products = response.xpath('//a[@class="link-cover"]//parent::div/h2/a')

        for product in products:

            name = product.xpath('.//text()')[1].get()
            link = product.xpath(".//@href").get()

            yield {
                "product_name": name,
                "product_link": link,
                "product_developer": response.request.meta['developer'],
                "product_category": response.request.meta['category']
            }

        next_page = response.xpath(
            '//nav[@class="navigation pagination"]/div[@class="nav-links"]/a[@class="next page-numbers"]/@href').get()
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse_products, meta={
                "developer": response.request.meta['developer'],
                "category": response.request.meta['category']
            })

Upvotes: 1

Views: 967

Answers (1)

E.Wiest
E.Wiest

Reputation: 5905

Just use //h2/a/text()[normalize-space()]. Full XPath expression for your website :

//div[@class="new-post-display new-posts2"]//h2/a/text()[normalize-space()]

Output :

 Anchor 
 Aqua 
 Architect 
 Arctica 
 Aspiration 
 BandZone 
 Barcelona 
 BeachClub 
 Brick 
 BusinessFinder+
 ...

EDIT : Your XPath expression works in scrapy shell.

Selectors

Get the data :

Extract

I think the problem is in your spider code. You've posted this as a result :

[<Selector xpath='.//text()' data='AIT Themes'>,...

Replace in your spider .//text() with ./text() and you should be OK.

Sidenote : if you want to use an index, fix your XPath accordingly :

response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()[1]')

Upvotes: 1

Related Questions