Reputation: 10437
I'm doing this on android and I don't wish to use any html parsers (libraries) since the sole purpose is to know what html tags are present other than < b > < i > and < p >
Im using:
Pattern p = Pattern.compile("<^bip/>");
This works well returning all tags other than B, I, P, BUT it also removes < img > tag. Can someone modify it to not ignore img tag?
Upvotes: 1
Views: 530
Reputation: 7132
If you want to find which tags are in your document, I would advise to use a more-than-one step:
Doing this that way can be encapsulated into a class, is more configurable if ever you want to filter other tags, is easier to understand and maintain on the longer term than a cryptic regex.
My 2c
Upvotes: 1
Reputation: 784998
I think your regex definition should be like this:
Pattern p = Pattern.compile("(?i)<(?![bip]\\b).*?/?>");
?!
for negative look ahead // i.e. <
not followed by (b
or i
or p
) + word boundary(?i)
for ignore case comparison.*?
for optionally grabbing 0 or more characters after opening tags/?
for making trailing slash optional before >
Upvotes: 1
Reputation: 120496
Do you want to recognize or to remove tags?
And do you want to distinguish between tags like <img onerror='alert("PWNED")' src=bogus>
and ones that meet some definition of legit?
See http://code.google.com/p/owasp-java-html-sanitizer/source/browse/trunk/src/main/org/owasp/html/HtmlPolicyBuilder.java for a way to create lightweight HTML sanitization policies in java.
Upvotes: 0