Raghu
Raghu

Reputation: 1161

Removing text enclosed between HTML tags using JSoup

In some cases of HTML cleaning, I would like to retain the text enclosed between the tags(which is the default behaviour of Jsoup) and in some cases, I would like to remove the text as well as the HTML tags. Can someone please throw some light on how I can remove the text enclosed between the HTML tags using Jsoup?

Upvotes: 8

Views: 9801

Answers (2)

Jonathan Hedley
Jonathan Hedley

Reputation: 10522

The Cleaner will always drop tags and preserve text. If you need to drop elements (i.e. tags and text / nested elements), you can pre-parse the HTML, remove the elements using either remove() or empty(), then run the resulting through the cleaner.

For example:

String html = "Clean <div>Text dropped</div>";
Document doc = Jsoup.parse(html);
doc.select("div").remove();

// if not removed, the cleaner will drop the <div> but leave the inner text
String clean = Jsoup.clean(doc.body().html(), Whitelist.basic());

If you are using JSoup 1.14.1+ then use Safelist instead of Whitelist, as Whitelist has been deprecated and will be removed in 1.15.1.

String clean = Jsoup.clean(doc.body().html(), Safelist.basic());

Upvotes: 12

NomanJaved
NomanJaved

Reputation: 1380

1.     String html = "<!DOCTYPE html><html><head><title></title></head><body><p>hello there</p></body></html>";
2.      Document d = Jsoup.parse(html);
3.      System.out.println(d);
4.      System.out.println("************************************************");
5.      d.getElementsByTag("p").remove();
6.      System.out.println(d);

while you getting with Elements you getting some trouble you can do this action on Document d object. that will work accurate.

Upvotes: 0

Related Questions