Reputation: 6773
I am scraping a webpage using HTMLUnit and have collected a List of DOM nodes from the webpage.
Inside each of these "company" DOM nodes is some data I want to scrape. For example I want the telephone number text from inside this node:
Now, this element would be a child of a div element which is in turn a child of another div element inside the company node. What would be the correct XPath line to access it? Here is my latest attempt which returned nothing.
List<DomNode> companies = (List<DomNode>) page.getByXPath("//li[@class='featured block twoblock boxshadow']");
for (int j = 0; j < companies.size(); j++) {
DomNode company = companies.get(j);
// retrieve telephone number
DomNode telephone = (DomNode) company.getByXPath(
"//li[@data-pvd-p='"+j+1+"']/div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);
}
Here is a sample of the HTML, what:
<li class="featured block twoblock boxshadow" data-pvd-p="3" data-pvd-c="0046176330000011028" data-pvd-et="sv" data-pvd-l="true">
<div class="listingWrapper" itemtype="http://schema.org/LocalBusiness" itemscope="">
<a href="/Craddock-Electrical-Services-Ltd/0046176330000011028/"></a>
<div class="itemInfo">
<div class="tradeImage" itemprop="member" itemscope="" itemtype="http://schema.org/Organization"></div>
<h2>
<a itemprop="name" href="/Craddock-Electrical-Services-Ltd/0046176330000011028/"></a>
</h2>
<span class="tel" itemprop="telephone"></span>
<div class="listLinks"></div>
<div id="addressBar"></div>
</div>
<div class="itemInfo2"></div>
<div class="clearLeft"></div>
<ul class="features"></ul>
<div class="clearLeft"></div>
<p class="promo" itemprop="description"></p>
</div>
</li>
UPDATE 2:
Here is the current state of my XPath code.
List<DomNode> companies = (List<DomNode>) page
.getByXPath("//li[contains(@class, 'featured block')]");
for (int j = 0; j < companies.size(); j++) {
String url = "";
DomNode company = companies.get(j);
DomElement web = null;
// retrieve name
DomNode name = (DomNode) company.getByXPath("//a[@itemprop='name']").get(j);
if (companiesLogged.contains(name.getTextContent().trim()) != true) {
companiesLogged.add(name.getTextContent().trim());
// retrieve telephone number
DomNode telephone = (DomNode) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);
// retrieve website
try{
web = (DomElement) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']" +
"/div[@class='listLinks']/a[@id='linkWebsite']").get(0);
} catch(IndexOutOfBoundsException e){
System.out.print(" (No Website) ");
}
try{
url = web.getAttribute("href");
} catch (IndexOutOfBoundsException e){
url = "N/A";
}
System.out.println(name.getTextContent().trim() + " "
+ telephone.getTextContent().trim()
+" "+url.trim());
} else {
System.out.println("Company already logged");
}
}
Upvotes: 1
Views: 2834
Reputation: 3219
First thing I see is how you're retrieving the group of <li>
nodes. Just looking at your @class
attribute, you can't really tell how many spaces are in "featured block twoblock boxshadow
", but that XPath will only return a result if it is exactly equal to it.
In that regard, try using something more flexible like contains()
, i.e. //li[contains(@class, 'featured block')]
.
Without seeing what source you're targeting I can't suggest much more, but will update the answer when it's added to the question.
I've tried your XPath (just the /div part, since that's what was provided) on the given snippet and got back <span class="tel" itemprop="telephone"/>
as a result. Looks like an issue with how you're retrieving the <li>
company nodes.
Update 2:
From your updated XML snippet, your first XPath //li[@class='featured block twoblock boxshadow']"
doesn't look like it will match the parent <li>
node, based on what I mentioned with the spaces before. Secondly if it did, you are checking the <li>
node's attributes twice on separate queries, and assuming that the index you're giving the data-pvd-p
value (starts at 3 in the snippet) will always match the list index (starts at 0, with your +1 added). I'd suggest removing this portion //li[@data-pvd-p='"+j+1+"']
and beginning with the //div
.
So something like this:
List<DomNode> companies = (List<DomNode>) page.getByXPath("//li[contains(@class, 'featured block']");
for (DomNode node : companies) {
// retrieve telephone number
DomNode telephone = (DomNode) node.getByXPath(
"div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);
Upvotes: 3