Reputation: 16764
I want to find a match in a html string.
That will not be between html tags or inside them.
For example:
the word is : ue
<span color=blue>ue</span>ue<span>sdfsd</span>
so I want to find only the third match (not inside "blue") and not between the span
tags.
Thanks
Upvotes: 1
Views: 3974
Reputation: 34385
Assuming you are dealing with a fragment of HTML (and not a complete document), you can write a regular expression to match most well-formed innermost, non-nested elements, and then apply this regex recursively to remove all tagged material, leaving the desired non-tagged material left over from between the tags. Here is just such a regex (in commented PHP/PCRE 'x' syntax) which matches most empty and non-empty, non-nested, non-shorttag HTML elements.
$re_html = '%# Match non-nested, non-shorttag HTML empty and non-empty elements.
< # Opening tag opening "<" delimiter.
(\w+)\b # $1: Tag name.
(?: # Non-capture group for optional attribute(s).
\s+ # Attributes must be separated by whitespace.
[\w\-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
\s*=\s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string.
| \'[^\']*\' # Single quoted string.
| [\w\-.:]+\b # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value alternatives.
)? # Attribute value is optional.
)* # Allow zero or more attribute=value pairs
\s* # Whitespace is allowed before closing delimiter.
(?: # This element is either empty or has close tag.
/> # Is either an empty tag having no contents,
| > # or has both opening and closing tags.
( # $2: Tag contents.
[^<]* # Everything up to next tag. (normal*)
(?: # We found a tag (open or close).
(?!</?\1\b) < # Not us? Match the "<". (special)
[^<]* # More of everything up to next tag. (normal*)
)* # Unroll-the-loop. (special normal*)*
) # End $2. Tag contents.
</\1\s*> # Closing tag.
)
%x';
Here's the same regex in Javascript syntax:
var re_html = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/;
The following javascript function strips HTML elements leaving the desired text between the tags:
// Strip HTML elements.
function strip_html_elements(text) {
// Match non-nested, non-shorttag HTML empty and non-empty elements.
var re = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/g;
// Loop removing innermost HTML elements from inside out.
while (text.search(re) !== -1) {
text = text.replace(re, '');
}
return text;
}
This regex solution is not a proper parser and handles only simple HTML fragments having only html elements. It does not (and cannot) correctly process more complex markup having such things as comments, CDATA sections, and doctype statements. It does not remove elements missing their optional close tags (i.e. <p>
and <li>
elements.)
Upvotes: 5
Reputation: 816312
As you have excellent DOM manipulation possibilities in the browser, you can make use of this. You could create a new element, set the string as content and iterate over all text nodes:
var tmp = document.createElement('div');
tmp.innerHTML = htmlString;
var matches = [],
children = tmp.childNodes,
node,
word = ' ' + word + ' ';
for(var i = children.length; i--; ) {
node = children[i];
if(node.nodeType === 3 && (' ' + node.nodeValue + ' ').indexOf(word) > -1) {
matches.push(node);
}
}
Upvotes: 2
Reputation: 1074058
You're trying to use regular expressions to parse HTML. HTML cannot be readily, reliably processed with a regular expression on its own.
If you're doing this on a browser, you can instead leverage the browser's highly-optimized HTML parser.
If you want to detect the word when there's a tag in-between (e.g., "u<hr>e"):
var element, node, topLevelText;
element = document.createElement('div');
element.innerHTML = "<span color=blue>ue</span>ue<span>sdfsd</span>";
topLevelText = "";
for (node = element.firstChild; node; node = node.nextSibling) {
if (node.nodeType === 3) { // 3 = text node
topLevelText += node.nodeValue;
}
}
if (topLevelText.indexOf(word) >= 0) {
// Found
}
If you only want to detect it between things (so, your example but not "u<hr>e"):
var element, node;
element = document.createElement('div');
element.innerHTML = "<span color=blue>ue</span>ue<span>sdfsd</span>";
for (node = element.firstChild; node; node = node.nextSibling) {
if (node.nodeType === 3) { // 3 = text node
if (node.nodeValue.indexOf(word) >= 0) {
// Found
}
}
}
(Both of those do case-sensitive matching.)
That does this
document.createElement
.innerHTML
on the element. This property has only recently been standardized, but it's been supported by all major browsers for a decade or so.Node#firstChild
, Node#nodeType
, Node#nodeValue
, and Node#nextSibling
.The links above are mostly to the DOM2 Core spec, most of which is supported by most browsers. Other references that can be handy:
Upvotes: 4
Reputation: 72469
HTML is not a regular language, so it cannot be parsed by regular expressions.
Upvotes: 2