Reputation: 119
I'm trying to retreive "prace.avizo.cz"
and "onlineprodej.cz"
from the following html. I've tried several different variations to isolate that one url but none have been successful.
I'm trying to get it via an importXML
function in a googledoc. Some of the paths I've tried are:
=importXML(B2,"//article[@class='genericlist component leadingReferers']//ul/li[1]")
=importXML(B2,"//ul[@class='sites items']//li[1]")
=importXML(B2,"//li[@class='item']//div//a")
These either don't work or return extra irrelevant data. I'm only looking for the data within this specific article class (genericlist component leadingReferers
).
Any help is appreciated.
<article class="genericlist component leadingReferers">
<h2 class="title">
Top Publishers
<i class="tooltip sprite icon_tip_idle" title="&lt;h1&gt;Leading paid referring sites&lt;/h1&gt;Leading publishers referring advertising traffic to Cz.indeed.com"></i>
</h2>
<ul class="sites items">
<li class="item ">
<div class="text" title="prace.avizo.cz" data-sitename="prace.avizo.cz">
<a class="link" href="/website/prace.avizo.cz" data-tipsygravity="w" data-shorturl="Prace.avizo.cz">
<img class="icon lazy-icon lazy" data-original="http://images2.similargroup.com/image?url=prace.avizo.cz&t=2&s=1&h=11351681863127555753" src="/images/lazy.png"/>
<noscript>
<img class="icon" src="http://images2.similargroup.com/image?url=prace.avizo.cz&t=2&s=1&h=11351681863127555753"/>
</noscript>
Prace.avizo.cz
</a>
</div>
<div class="progress-bar">
<div class="progress-value percentage per1" style="width: 62.91%"></div>
</div>
</li>
<li class="item ">
<div class="text" title="onlineprodej.cz" data-sitename="onlineprodej.cz">
<a class="link" href="/website/onlineprodej.cz" data-tipsygravity="w" data-shorturl="Onlineprodej.cz">
<img class="icon lazy-icon lazy" data-original="http://images2.similargroup.com/image?url=onlineprodej.cz&t=2&s=1&h=14252445317786093368" src="/images/lazy.png"/>
<noscript>
<img class="icon" src="http://images2.similargroup.com/image?url=onlineprodej.cz&t=2&s=1&h=14252445317786093368"/>
</noscript>
Onlineprodej.cz
</a>
</div>
<div class="progress-bar">
<div class="progress-value percentage per1" style="width: 50.88%"></div>
</div>
</li>
....
Upvotes: 1
Views: 220
Reputation: 23637
This expression will give you the last text node inside the <a>
of the first item in the article:
//article[@class='genericlist component leadingReferers']//li[1]//a/text()[last()]
which is the one that contains the text Prace.avizo.cz
(surrounded by spaces, tabs and newlines). If you wish to trim those extra spaces, you can pass that expression as the argument to the XPath function normalize-space()
:
normalize-space( //article[@class='genericlist component leadingReferers']//li[1]//a/text()[last()] )
You can select the second article in a similar manner (same expression, using li[2]
):
//article[@class='genericlist component leadingReferers']//li[2]//a/text()[last()]
If you want to retrieve a collection containing all text nodes (which you can manupulate outside of XPath) you can use:
//article[@class='genericlist component leadingReferers']//li//a/text()[last()]
which will return a list containing all text nodes (two, in your example). In this case, you will have to use your host language to extract them (probably in a for-each loop).
Upvotes: 1