yonutix
yonutix

Reputation: 2439

HTML tag regex doesen't work

Why this code doesen't return "" ? What regex should I use to replace all tags from a html file?

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

Thanks!

Upvotes: 1

Views: 87

Answers (3)

ajb
ajb

Reputation: 31699

I will agree with everyone else that attempting to use a regex to parse HTML is a bad idea. (And I think that's true even if all you're doing is removing the tags; things like comments and !CDATA will complicate any attempt at a simple solution.) However, I think it's useful to explain why your solution didn't produce the results you expected (because this applies to other situations where regexes are more appropriate).

By default, the * and + quantifiers are greedy, which means they will match as many characters as they can. Thus, in your example:

x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");

I think this is what you meant:

String x = "<h3><a href=\"#\">current community</a></h3>";
x = x.replaceAll("<.*>", "");

When the matching engine searches for your pattern, it finds < as the first character of x. Then it looks for a sequence of zero or more characters that can be anything, followed by >. But since it's a greedy quantifier, if there's a choice of more than one > it can pick, it will pick the one that makes .* match the longest possible string. In your case, that means that it will pick the > which is the last character of x. The effect is that the entire string is replaced by "".

To make it match the smallest possible string, add ? to make it a "reluctant quantifier":

x = x.replaceAll("<.*?>", "");

Another solution is to tell the matcher not to include > when matching "any character":

x = x.replaceAll("<[^>]*>", "");

[^>] means "match any character except >. For HTML/XML/SGML, the regex I would choose is neither of the above, since you shouldn't use regular expressions to parse complex structures like that.

Upvotes: 3

user557597
user557597

Reputation:

Disclaimer: You shouldn't use regex to parse html.

But, if you insist, try a

Find: "<(?:(?:/?\\w+\\s*/?)|(?:\\w+\\s+(?:(?:(?:\"[\\S\\s]*?\")|(?:'[\\S\\s]*?'))|(?:[^>]*?))+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Replace: ""

 <
 (?:
      (?:
           /? 
           \w+ 
           \s* 
           /? 
      )
   |  
      (?:
           \w+ 
           \s+ 
           (?:
                (?:
                     (?: " [\S\s]*? " )
                  |  (?: ' [\S\s]*? ' )
                )
             |  (?: [^>]*? )
           )+
           \s* 
           /? 
      )
   |  
      \?
      [\S\s]*? 
      \?
   |  
      (?:
           !
           (?:
                (?:
                     DOCTYPE
                     [\S\s]*? 
                )
             |  (?:
                     \[CDATA\[
                     [\S\s]*? 
                     \]\]
                )
             |  (?:
                     --
                     [\S\s]*? 
                     --
                )
             |  (?:
                     ATTLIST
                     [\S\s]*? 
                )
             |  (?:
                     ENTITY
                     [\S\s]*? 
                )
             |  (?:
                     ELEMENT
                     [\S\s]*? 
                )
           )
      )
 )
 >

Upvotes: 2

Reimeus
Reimeus

Reputation: 159784

I want to remove the HTML tags

You could simply use a HTML parsing library such as JSoup. Here is an example

Document doc = 
     Jsoup.parse("<html><h3><a href=\"#\">current community</a></h3></html>");
System.out.println(doc.text());

Output:

current community

Upvotes: 4

Related Questions