Reputation: 452
I am parsing html using jsoup and want to extract innerHtml inside of body tag
so far I tried and use document.body.childern().outerHtml; but its giving only html element and skipping floating text(not wrapped within any html tag) inside of body
private String getBodyTag(final Document document) {
return document.body().children().outerHtml();
}
Input:
<!DOCTYPE html>
<html lang="de">
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" type="text/css" href="assets/style.css">
</head>
<body>
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
some sample raw/floating text
</body>
</html>
Expected:
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
some sample raw/floating text
Actual:
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
Upvotes: 3
Views: 1936
Reputation: 255
Please use this:
private String getBodyTag(final Document document) {
return document.body().html();
}
Upvotes: 5
Reputation: 370
You could try returning document.body.innerHtml;
instead, so it would return everything inside the body tag, including the text outside any tag.
As far as I know, the way you are trying to accomplish it is not working because the "raw text" is not considered a child.
Upvotes: 0