sham
sham

Reputation: 452

Extract innerHtml out of body tag using jsoup

I am parsing html using jsoup and want to extract innerHtml inside of body tag

so far I tried and use document.body.childern().outerHtml; but its giving only html element and skipping floating text(not wrapped within any html tag) inside of body

private String getBodyTag(final Document document) {
        return document.body().children().outerHtml();
}

Input:

<!DOCTYPE html>
<html lang="de">
    <head>
        <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <link rel="stylesheet" type="text/css" href="assets/style.css">
    </head>
    <body>
       <div>questions to improve formatting and clarity.</div>
       <h3>Guided Mode</h3> 
       some sample raw/floating text
    </body>
</html>

Expected:

<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3> 
some sample raw/floating text

Actual:

<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>

Upvotes: 3

Views: 1936

Answers (2)

Sergey Bzhezitskiy
Sergey Bzhezitskiy

Reputation: 255

Please use this:

private String getBodyTag(final Document document) {
    return document.body().html();
}

Upvotes: 5

Leonardo Meinerz Ramos
Leonardo Meinerz Ramos

Reputation: 370

You could try returning document.body.innerHtml; instead, so it would return everything inside the body tag, including the text outside any tag.

As far as I know, the way you are trying to accomplish it is not working because the "raw text" is not considered a child.

Upvotes: 0

Related Questions