tablecloth26
tablecloth26

Reputation: 49

Storing text into a String using jSoup

I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.

I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.

private static String getText() {
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            System.out.println(p.text());
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return null;
}

}

When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.

private static String getText() {
    String text = "";
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            text=p.text();
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return text;
}

Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.

Upvotes: 0

Views: 165

Answers (3)

RBRi
RBRi

Reputation: 2879

I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.

I hope this code will fulfill your requirements without jSoup.

private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    StringBuilder text = new StringBuilder();
    try (WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
        for (DomNode p : paragraphs) {
            text.append(p.asText());
        }
    }
    return text.toString();
}

Upvotes: 0

Flika205
Flika205

Reputation: 552

Document doc = Jsoup.connect(url).get();
String text = doc.text();

That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.

Upvotes: 1

Santosh Hegde
Santosh Hegde

Reputation: 3520

    for (Element p : paragraphs)
        text+=p.text(); // Append the text.

In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.

Upvotes: 0

Related Questions