Free Lancer
Free Lancer

Reputation: 1000

How to make a Regex Pattern for HTML Simple Text?

I am trying to learn Regex patterns for a class. I am making a simple HTML Lexer/Parser. I know this is not the best or most efficient way to make a Lexer/Parser but it is only to understand Regex patterns.

So my question is, How do I create a pattern that checks if the String does not contain any HTML tags (ie <TAG>) and does not contain any HTML Entities (ie &ENT;)?

This is what I could come up with so far but it still does not work:

.+?(^(?:&[A-Za-z0-9#]+;)^(?:<.*?>))

EDIT: The only problem is that I can't negate the final outcome I need to find a complete pattern that would accomplish this task if it's possible, although it might not be pretty. I never mentioned but it's pretty much supposed to match any Simple Text in an HTML page.

Upvotes: 2

Views: 1950

Answers (2)

Platinum Azure
Platinum Azure

Reputation: 46203

If you're looking to match strings that do NOT follow a pattern, the simplest thing to do is to match the pattern and then negate the result of the test.

<[^>]+>|&[^;]+;

Any string that matches this pattern will have AT LEAST ONE tag (as you've defined it) or entity (as you've defined it). So the strings you want are strings that DO NOT match this pattern (they will have NO tags or entities).

Upvotes: 1

aioobe
aioobe

Reputation: 421150

You could use the expression <.+?>|&.+?; to search for a match, and then negate the result.

  • <.+?> says first a < then anything (one or more times) then a >
  • &.+?; says first a & then anything (one or more times) then a ;

Here is a complete example with an ideone.com demo here.

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String[] tests = { "hello", "hello <b>world</b>!", "Hello&nbsp;world" };
        Pattern p = Pattern.compile("<.+?>|&.+?;");
        for (String test : tests) {
            Matcher m = p.matcher(test);
            if (m.find())
                System.out.printf("\"%s\" has HTML: %s%n", test, m.group());
            else
                System.out.printf("\"%s\" does have no HTML%n", test);
        }
    }
}

Output:

"hello" does have no HTML
"hello <b>world</b>!" has HTML: <b>
"Hello&nbsp;world" has HTML: &nbsp;

Upvotes: 2

Related Questions