Taranfx
Taranfx

Reputation: 10437

Identify html tags present using Regex

I'm doing this on android and I don't wish to use any html parsers (libraries) since the sole purpose is to know what html tags are present other than < b > < i > and < p >

Im using:

Pattern p = Pattern.compile("<^bip/>");

This works well returning all tags other than B, I, P, BUT it also removes < img > tag. Can someone modify it to not ignore img tag?

Upvotes: 1

Views: 530

Answers (3)

Bruce
Bruce

Reputation: 7132

If you want to find which tags are in your document, I would advise to use a more-than-one step:

  • extract all tags and put them in a list; regexp is fairly simple <(.*?)>
  • sort your list for unicity, filter it to remove your unwanted symbols (like i,b,p...)

Doing this that way can be encapsulated into a class, is more configurable if ever you want to filter other tags, is easier to understand and maintain on the longer term than a cryptic regex.

My 2c

Upvotes: 1

anubhava
anubhava

Reputation: 784998

I think your regex definition should be like this:

Pattern p = Pattern.compile("(?i)<(?![bip]\\b).*?/?>");
  • ?! for negative look ahead // i.e. < not followed by (b or i or p) + word boundary
  • (?i) for ignore case comparison
  • .*? for optionally grabbing 0 or more characters after opening tags
  • /? for making trailing slash optional before >

Upvotes: 1

Mike Samuel
Mike Samuel

Reputation: 120496

Do you want to recognize or to remove tags?

And do you want to distinguish between tags like <img onerror='alert("PWNED")' src=bogus> and ones that meet some definition of legit?

See http://code.google.com/p/owasp-java-html-sanitizer/source/browse/trunk/src/main/org/owasp/html/HtmlPolicyBuilder.java for a way to create lightweight HTML sanitization policies in java.

Upvotes: 0

Related Questions