Balázs Németh
Balázs Németh

Reputation: 6637

Java replace all non-HTML Tags in a String

I'd like to replace all the tag-looking parts in a String if those are not valid HTML tags. A tag-looking part is something enclosed in <> brackets. Eg. <[email protected]> or <hello> but <br>, <div>, and so on has to be kept.

Do you have any idea how to achieve this?

Any help is appreciated!

cheers,

balázs

Upvotes: 5

Views: 2239

Answers (4)

dogbane
dogbane

Reputation: 274612

You can use JSoup to clean HTML.

String cleaned = Jsoup.clean(html, Whitelist.relaxed());

You can either use one of the defined Whitelists or you can create your own custom one in which you specify which HTML elements you wish to allow through the cleaner. Everything else is removed.


Your specific example would be:

String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>";
String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class"));
System.out.println(cleaned);

Output:

one two three  four  five 
<div class="bold">
 six
</div>

Upvotes: 9

axtavt
axtavt

Reputation: 242706

If you do it in order to display untrusted data on the web page, simple removing of invalid tags is not enough. Take a look at OWASP AntiSamy.

Upvotes: 0

The Ox
The Ox

Reputation: 35

You may also want to include ending tags in your comparison algorithm. So you may want to look for a forward slash(html end tag) and strip it before your comparison.

Upvotes: 0

Manse
Manse

Reputation: 38147

Have a look at the java.util.Scanner class - you can set a delimiter then see if the string matches HTML tag or not - you will have to build an Array of strings that should be ignored.

Upvotes: 0

Related Questions