Reputation: 55
I am fetching HTML from a webpage and trying to retreive data from it.
I have HTML like <h3><strong>title</strong><h3>
that I want to replace with <h2>
.
But, sometimes I find unexpected tags inside of the content, for example:
<h3><br/><strong>title</strong></h3>
How can i remove empty html tags like <p><br></p>
and <h3><br /><h3>
from a string?
Upvotes: 1
Views: 498
Reputation: 2876
To replace empty elements you can use the CSS selector :empty
. Do so in a loop, since an element containing an empty element is not considered empty, but will be removed in the next iteration.
To replace <h3><strong>...</strong><h3>
tags with <h2><strong>...</strong><h2>
and remove other tags inside the <h3>
tag, use replaceWith
:
Example Code
Document doc = Jsoup.connect("url").get();
// clean up empty elements
while(!doc.select(":empty").isEmpty()){
doc.select(":empty").remove();
}
//replace h3 with h2
doc.select("h3 > strong").forEach(strong -> {
strong.parent().replaceWith(new Element(Tag.valueOf("h2"), "").html("<strong>" + strong.text() + "</strong>"));
});
Upvotes: 1
Reputation: 2220
You could always try using jsoup's .text() method on the element to grab the text only, and them putting that text inside an h3.
Upvotes: 1