Fidelis
Fidelis

Reputation: 91

HtmlUnit - get the text between 2 tags without id

I'm use htmlUnit on this page http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html There is an index of sections and each section has own text. I want to create a method which return the text passing the name of section.

All sections'name are inside a tag called 'lista' and I get this information in this way:

HtmlPage page = webClient.getPage("http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html");
final String pageAsText = page.asText();
        final Iterable<DomElement> div = page.getHtmlElementById("lista").getChildElements();
        ArrayList<String> menu = new ArrayList<>();
        for (DomElement e : div) {
            menu.add(e.asText());
        }

All information are inside a span that I iterate:

Iterable<DomElement> desc = page.getHtmlElementById("foglietto_descrizioni").getChildElements();

Each section are inside a h2 tag without id or class. So I don't know how extract all the text between a tag h2 to another.

span "foglietto_descrizioni"

Upvotes: 2

Views: 1341

Answers (2)

Ahmed Ashour
Ahmed Ashour

Reputation: 5549

You can use .getByXPath, as in the below example:

    try (WebClient webClient = new WebClient()) {
        HtmlPage page = webClient.getPage("http://www.my-personaltrainer.it/Foglietti-illustrativi/Torvast.html");
        HtmlElement span = page.getHtmlElementById("foglietto_descrizioni");
        for (Object o : span.getByXPath(".//h2")) {
            HtmlHeading2 h2 = (HtmlHeading2) o;
            System.out.println("text 1 = " + h2.getFirstChild().getNextSibling().asText());
            System.out.println("text 2 = " + h2.<HtmlElement>getFirstByXPath("./span").asText());
        }
    }

Note that . means from this node, / means search direct children, while // means search children and grand-children recursively,

Upvotes: 1

Thowk
Thowk

Reputation: 407

If elements hierarchy follow a pattern, you can access H2 tag like that:

$('#Indicazioni').parent()

Then if you want to get all the text inside H2 you can use:

$('#Indicazioni').parent().text()

Not sure if that answers your question.

I haven't used HtmlUnit, but from what I can see it has support for jQuery.

Upvotes: 1

Related Questions