Itzik984
Itzik984

Reputation: 16764

Regex - how to find a word not enclosed by html tags or between them

I want to find a match in a html string.

That will not be between html tags or inside them.

For example:

the word is : ue

<span color=blue>ue</span>ue<span>sdfsd</span>

so I want to find only the third match (not inside "blue") and not between the span tags.

Thanks

Upvotes: 1

Views: 3974

Answers (4)

ridgerunner
ridgerunner

Reputation: 34385

Assuming you are dealing with a fragment of HTML (and not a complete document), you can write a regular expression to match most well-formed innermost, non-nested elements, and then apply this regex recursively to remove all tagged material, leaving the desired non-tagged material left over from between the tags. Here is just such a regex (in commented PHP/PCRE 'x' syntax) which matches most empty and non-empty, non-nested, non-shorttag HTML elements.

$re_html = '%# Match non-nested, non-shorttag HTML empty and non-empty elements.
    <                    # Opening tag opening "<" delimiter.
    (\w+)\b              # $1: Tag name.
    (?:                  # Non-capture group for optional attribute(s).
      \s+                # Attributes must be separated by whitespace.
      [\w\-.:]+          # Attribute name is required for attr=value pair.
      (?:                # Non-capture group for optional attribute value.
        \s*=\s*          # Name and value separated by "=" and optional ws.
        (?:              # Non-capture group for attrib value alternatives.
          "[^"]*"        # Double quoted string.
        | \'[^\']*\'     # Single quoted string.
        | [\w\-.:]+\b    # Non-quoted attrib value can be A-Z0-9-._:
        )                # End of attribute value alternatives.
      )?                 # Attribute value is optional.
    )*                   # Allow zero or more attribute=value pairs
    \s*                  # Whitespace is allowed before closing delimiter.
    (?:                  # This element is either empty or has close tag.
      />                 # Is either an empty tag having no contents,
    | >                  # or has both opening and closing tags.
      (                  # $2: Tag contents.
        [^<]*            # Everything up to next tag. (normal*)
        (?:              # We found a tag (open or close).
          (?!</?\1\b) <  # Not us? Match the "<". (special)
          [^<]*          # More of everything up to next tag. (normal*)
        )*               # Unroll-the-loop. (special normal*)*
      )                  # End $2. Tag contents.
      </\1\s*>           # Closing tag.
    )
    %x';

Here's the same regex in Javascript syntax:

var re_html = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/;

The following javascript function strips HTML elements leaving the desired text between the tags:

// Strip HTML elements.
function strip_html_elements(text) {
    // Match non-nested, non-shorttag HTML empty and non-empty elements.
    var re = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/g;
    // Loop removing innermost HTML elements from inside out.
    while (text.search(re) !== -1) {
        text = text.replace(re, '');
    }
    return text;
}

This regex solution is not a proper parser and handles only simple HTML fragments having only html elements. It does not (and cannot) correctly process more complex markup having such things as comments, CDATA sections, and doctype statements. It does not remove elements missing their optional close tags (i.e. <p> and <li> elements.)

Upvotes: 5

Felix Kling
Felix Kling

Reputation: 816312

As you have excellent DOM manipulation possibilities in the browser, you can make use of this. You could create a new element, set the string as content and iterate over all text nodes:

var tmp = document.createElement('div');
tmp.innerHTML = htmlString;

var matches = [],
    children = tmp.childNodes,
    node,
    word = ' ' + word + ' ';

for(var i = children.length; i--; ) {
    node = children[i];
    if(node.nodeType === 3 && (' ' + node.nodeValue + ' ').indexOf(word) > -1) {
        matches.push(node);
    }
}

Upvotes: 2

T.J. Crowder
T.J. Crowder

Reputation: 1074058

You're trying to use regular expressions to parse HTML. HTML cannot be readily, reliably processed with a regular expression on its own.

If you're doing this on a browser, you can instead leverage the browser's highly-optimized HTML parser.

If you want to detect the word when there's a tag in-between (e.g., "u<hr>e"):

var element, node, topLevelText;
element = document.createElement('div');
element.innerHTML = "<span color=blue>ue</span>ue<span>sdfsd</span>";
topLevelText = "";
for (node = element.firstChild; node; node = node.nextSibling) {
    if (node.nodeType === 3) { // 3 = text node
        topLevelText += node.nodeValue;
    }
}
if (topLevelText.indexOf(word) >= 0) {
    // Found
}

If you only want to detect it between things (so, your example but not "u<hr>e"):

var element, node;
element = document.createElement('div');
element.innerHTML = "<span color=blue>ue</span>ue<span>sdfsd</span>";
for (node = element.firstChild; node; node = node.nextSibling) {
    if (node.nodeType === 3) { // 3 = text node
        if (node.nodeValue.indexOf(word) >= 0) {
            // Found
        }
    }
}

(Both of those do case-sensitive matching.)

That does this

  1. Creates an element that isn't displayed anywhere using document.createElement.
  2. Parses the HTML text by assigning it to innerHTML on the element. This property has only recently been standardized, but it's been supported by all major browsers for a decade or so.
  3. Looks through the immediate children of the node, which will include any elements created by parsing, and text nodes for the top-level text in the string (e.g., text in the place where you want to search for it). This is using Node#firstChild, Node#nodeType, Node#nodeValue, and Node#nextSibling.
  4. Depending on whether you want to find it in the "u<hr>e" situation, it either looks directly at the text in each of the text nodes, or it builds them all up into a string and searches that afterward.

The links above are mostly to the DOM2 Core spec, most of which is supported by most browsers. Other references that can be handy:

Upvotes: 4

Yakov Galka
Yakov Galka

Reputation: 72469

HTML is not a regular language, so it cannot be parsed by regular expressions.

Upvotes: 2

Related Questions