Parsing HTML into formatted plaintext using jsoup

Question

I was working on a maven project that allows me to parse a html data from a website. I was able to parse it using this code below:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

So far I have no problems. I can parse the html data. I was using select method from jsoup and retrieving data using "div.col-section" which means I'm looking for div element with the class is col-section. I wanted to print the data in a textarea. The result that I have is a huge one paragraph even though the real data on the website is more than one paragraphs. So how to parse the data just like the one on the website?

Jonathan Hedley · Accepted Answer

The reason that it is not formatted is that the formatting is in the HTML -- with

and

.text()

Jsoup has an example HTML to Plain Text convertor which you can adapt to your needs -- by providing the div element as the focus.

Alternatively, you could just select "div.col-section > *", and iterate through each Element, and print out that text with a newline.

Parsing HTML into formatted plaintext using jsoup

Answers (1)

Related Questions