teair
teair

Reputation: 187

Delete specific HTML tags in String

I want to delete HTML tags(that are defined in an array) in a string.My approach:

public String cleanHTML(String unsafe,String[] blacklist){
   String safe = "";
   for(String s:blacklist){
      safe =unsafe.replaceAll("\\<.{0,1}"+s+".*?>", "");
   }

   return safe;}

To test my function I use the following main method:

public static void main(String a[]){
    StringParser sp = new StringParser();
    String[] blacklist = new String[]{"img","a"};

    System.out.println( sp.cleanHTML("<p class='p1'>paragraph</p><img></img>< this is not html > <A HREF='#'>Link</A><a link=''>another link</a> <![CDATA[<sender>John Doe</sender>]]>",blacklist));

}

Output:

<p class='p1'>paragraph</p><img></img>< this is not html > <A href='#'>Link</A> <![CDATA[<sender>John Doe</sender>]]>another link

As you can see it only replaces the "another link" part.So I basically have two questions:1.)how can I get my regex to replace every < a > regardless if its lower or upper case and 2.) how can I get my code to delete every blacklisted tag,not only the last one in the array?

Thanks in advance.

Upvotes: 0

Views: 637

Answers (1)

Thomas
Thomas

Reputation: 88747

1.)how can I get my regex to replace every < a > regardless if its lower or upper case

As already said by others, it would be best to use some HTML parser/cleaner since HTML doesn't fit regular expressions too well.

However, if you still want to use regular expressions and make some assumptions (e.g. the HTML is wellformed) you might want to use something like this expression:

(?i)</?(?:p|img|a).*?>

The expression is case-insensitive ((?i)) and .* would make the expression match as little as possible. However this would have problems if an attribute contained a closing bracket, e.g. <a href="whatever" title=">>>"> would not be matched correctly. You could try ans match pairs of quotation marks as well but as you can see the expression gets ever more complicated. That's one reason why regex don't fit HTML that well.

how can I get my code to delete every blacklisted tag,not only the last one in the array?

You need to operate on the intermediate result instead of on the initial parameter value:

String intermediate = unsafe;
for(String s:blacklist){
  intermediate = intermediate.replaceAll("\\<.{0,1}"+s+".*?>", "");
}
String safe = intermediate; //maybe do some additional checks here

Of course if there's a large blacklist, you might want to work on a StringBuffer instead.

Another option, as I already demonstrated above, might be to add all those tags as alternation options, i.e. (?:a|img|p|br) etc., but if that list becomes too big it might also decrease performance.

Upvotes: 4

Related Questions