ramboeistblast
ramboeistblast

Reputation: 55

Replace All unprediction combination HTML tag using Jsoup

I am fetching HTML from a webpage and trying to retreive data from it.

I have HTML like <h3><strong>title</strong><h3> that I want to replace with <h2>. But, sometimes I find unexpected tags inside of the content, for example:
<h3><br/><strong>title</strong></h3>

How can i remove empty html tags like <p><br></p> and <h3><br /><h3> from a string?

Upvotes: 1

Views: 498

Answers (2)

Frederic Klein
Frederic Klein

Reputation: 2876

To replace empty elements you can use the CSS selector :empty. Do so in a loop, since an element containing an empty element is not considered empty, but will be removed in the next iteration.

To replace <h3><strong>...</strong><h3> tags with <h2><strong>...</strong><h2> and remove other tags inside the <h3> tag, use replaceWith:

Example Code

Document doc = Jsoup.connect("url").get();

// clean up empty elements
while(!doc.select(":empty").isEmpty()){
    doc.select(":empty").remove();
}

//replace h3 with h2
doc.select("h3 > strong").forEach(strong -> {
    strong.parent().replaceWith(new Element(Tag.valueOf("h2"), "").html("<strong>" + strong.text() + "</strong>"));
});

Upvotes: 1

Zachary Craig
Zachary Craig

Reputation: 2220

You could always try using jsoup's .text() method on the element to grab the text only, and them putting that text inside an h3.

Upvotes: 1

Related Questions