Isolating a nested sibling tag using XPATH

Question

I'm trying to retreive "prace.avizo.cz" and "onlineprodej.cz" from the following html. I've tried several different variations to isolate that one url but none have been successful.

I'm trying to get it via an importXML function in a googledoc. Some of the paths I've tried are:

=importXML(B2,"//article[@class='genericlist component leadingReferers']//ul/li[1]")

=importXML(B2,"//ul[@class='sites items']//li[1]")

=importXML(B2,"//li[@class='item']//div//a")

These either don't work or return extra irrelevant data. I'm only looking for the data within this specific article class (genericlist component leadingReferers).

Any help is appreciated.




    Top Publishers
    


        
                
                    
                            
                            

                        Prace.avizo.cz
                    
                
                
                    
                

        
        
                
                    
                            
                            

                        Onlineprodej.cz
                    
                
                
                    
                
        
....

helderdarocha · Accepted Answer

This expression will give you the last text node inside the of the first item in the article:

//article[@class='genericlist component leadingReferers']//li[1]//a/text()[last()]

which is the one that contains the text Prace.avizo.cz (surrounded by spaces, tabs and newlines). If you wish to trim those extra spaces, you can pass that expression as the argument to the XPath function normalize-space():

normalize-space( //article[@class='genericlist component leadingReferers']//li[1]//a/text()[last()] )

You can select the second article in a similar manner (same expression, using li[2]):

//article[@class='genericlist component leadingReferers']//li[2]//a/text()[last()]

If you want to retrieve a collection containing all text nodes (which you can manupulate outside of XPath) you can use:

//article[@class='genericlist component leadingReferers']//li//a/text()[last()]

which will return a list containing all text nodes (two, in your example). In this case, you will have to use your host language to extract them (probably in a for-each loop).

Isolating a nested sibling tag using XPATH

Answers (1)

Related Questions