Reputation: 2439
Why this code doesen't return "" ? What regex should I use to replace all tags from a html file?
x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");
Thanks!
Upvotes: 1
Views: 87
Reputation: 31699
I will agree with everyone else that attempting to use a regex to parse HTML is a bad idea. (And I think that's true even if all you're doing is removing the tags; things like comments and !CDATA
will complicate any attempt at a simple solution.) However, I think it's useful to explain why your solution didn't produce the results you expected (because this applies to other situations where regexes are more appropriate).
By default, the *
and +
quantifiers are greedy, which means they will match as many characters as they can. Thus, in your example:
x = x.replaceAll("<.*>", "<h3><a href=\"#\">current community</a></h3>");
I think this is what you meant:
String x = "<h3><a href=\"#\">current community</a></h3>";
x = x.replaceAll("<.*>", "");
When the matching engine searches for your pattern, it finds <
as the first character of x
. Then it looks for a sequence of zero or more characters that can be anything, followed by >
. But since it's a greedy quantifier, if there's a choice of more than one >
it can pick, it will pick the one that makes .*
match the longest possible string. In your case, that means that it will pick the >
which is the last character of x
. The effect is that the entire string is replaced by ""
.
To make it match the smallest possible string, add ?
to make it a "reluctant quantifier":
x = x.replaceAll("<.*?>", "");
Another solution is to tell the matcher not to include >
when matching "any character":
x = x.replaceAll("<[^>]*>", "");
[^>]
means "match any character except >
. For HTML/XML/SGML, the regex I would choose is neither of the above, since you shouldn't use regular expressions to parse complex structures like that.
Upvotes: 3
Reputation:
Disclaimer: You shouldn't use regex to parse html.
But, if you insist, try a
Find: "<(?:(?:/?\\w+\\s*/?)|(?:\\w+\\s+(?:(?:(?:\"[\\S\\s]*?\")|(?:'[\\S\\s]*?'))|(?:[^>]*?))+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Replace: ""
<
(?:
(?:
/?
\w+
\s*
/?
)
|
(?:
\w+
\s+
(?:
(?:
(?: " [\S\s]*? " )
| (?: ' [\S\s]*? ' )
)
| (?: [^>]*? )
)+
\s*
/?
)
|
\?
[\S\s]*?
\?
|
(?:
!
(?:
(?:
DOCTYPE
[\S\s]*?
)
| (?:
\[CDATA\[
[\S\s]*?
\]\]
)
| (?:
--
[\S\s]*?
--
)
| (?:
ATTLIST
[\S\s]*?
)
| (?:
ENTITY
[\S\s]*?
)
| (?:
ELEMENT
[\S\s]*?
)
)
)
)
>
Upvotes: 2