Get text from XPath of immediate child

Question

I'm trying to retrieve up a product name from the following markup:


     
             AIT Themes 
                   Anchor 
             WordPress Theme for Campsites
             2.0.0

I want to get the name of product. I am currently using the following xpath:

//a[@class="link-cover"]//parent::div/h2/a/text()

But I am getting the result that is inside the span tag too. This is unwanted.

[, , ]
[, , ]
[, , ]
[, , ]

I tried to specify which element I want through the index.

response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()')[1]

But this does not work very well because this specific site has pages that vary in format, but the name of the product is always inside the a tag.

I tried to use the "not" xpath operator, but doesn't return anything.

//a[@class="link-cover"]//parent::div/h2/a/not(span)/text()

EDIT: For reference, I'm calling xpath through scrapy as follows:

    def parse_products(self, response):

        products = response.xpath('//a[@class="link-cover"]//parent::div/h2/a')

        for product in products:

            name = product.xpath('.//text()')[1].get()
            link = product.xpath(".//@href").get()

            yield {
                "product_name": name,
                "product_link": link,
                "product_developer": response.request.meta['developer'],
                "product_category": response.request.meta['category']
            }

        next_page = response.xpath(
            '//nav[@class="navigation pagination"]/div[@class="nav-links"]/a[@class="next page-numbers"]/@href').get()
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse_products, meta={
                "developer": response.request.meta['developer'],
                "category": response.request.meta['category']
            })

E.Wiest · Accepted Answer

Just use //h2/a/text()[normalize-space()]. Full XPath expression for your website :

//div[@class="new-post-display new-posts2"]//h2/a/text()[normalize-space()]

Output :

 Anchor 
 Aqua 
 Architect 
 Arctica 
 Aspiration 
 BandZone 
 Barcelona 
 BeachClub 
 Brick 
 BusinessFinder+
 ...

EDIT : Your XPath expression works in scrapy shell.

Get the data :

I think the problem is in your spider code. You've posted this as a result :

[,...

Replace in your spider .//text() with ./text() and you should be OK.

Sidenote : if you want to use an index, fix your XPath accordingly :

response.xpath('//a[@class="link-cover"]//parent::div/h2/a/text()[1]')

Get text from XPath of immediate child

Answers (1)

Related Questions