Reputation: 91
I'm use htmlUnit on this page http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html There is an index of sections and each section has own text. I want to create a method which return the text passing the name of section.
All sections'name are inside a tag called 'lista' and I get this information in this way:
HtmlPage page = webClient.getPage("http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html");
final String pageAsText = page.asText();
final Iterable<DomElement> div = page.getHtmlElementById("lista").getChildElements();
ArrayList<String> menu = new ArrayList<>();
for (DomElement e : div) {
menu.add(e.asText());
}
All information are inside a span that I iterate:
Iterable<DomElement> desc = page.getHtmlElementById("foglietto_descrizioni").getChildElements();
Each section are inside a h2 tag without id or class. So I don't know how extract all the text between a tag h2 to another.
Upvotes: 2
Views: 1341
Reputation: 5549
You can use .getByXPath, as in the below example:
try (WebClient webClient = new WebClient()) {
HtmlPage page = webClient.getPage("http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html");
HtmlElement span = page.getHtmlElementById("foglietto_descrizioni");
for (Object o : span.getByXPath(".//h2")) {
HtmlHeading2 h2 = (HtmlHeading2) o;
System.out.println("text 1 = " + h2.getFirstChild().getNextSibling().asText());
System.out.println("text 2 = " + h2.<HtmlElement>getFirstByXPath("./span").asText());
}
}
Note that .
means from this node, /
means search direct children, while //
means search children and grand-children recursively,
Upvotes: 1
Reputation: 407
If elements hierarchy follow a pattern, you can access H2 tag like that:
$('#Indicazioni').parent()
Then if you want to get all the text inside H2 you can use:
$('#Indicazioni').parent().text()
Not sure if that answers your question.
I haven't used HtmlUnit, but from what I can see it has support for jQuery.
Upvotes: 1