RandomQuestion
RandomQuestion

Reputation: 6988

Removing Html tags except few specific ones from String in java

My input is plain text string and requirement is to remove all html tags except few specific tags like:

<p>
<li>
<u>
<li>

If these specific tags have attributes like class or id, I want to remove these attributes.

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

Can it be handled by a set of regex's or could I make use of some library?

Upvotes: 4

Views: 9990

Answers (2)

RandomQuestion
RandomQuestion

Reputation: 6988

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);
 }

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>

Upvotes: 15

beny23
beny23

Reputation: 35008

For simple HTML, this may be sufficient:

// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

Hope that helps.

Upvotes: 4

Related Questions