Reputation: 2053
Aware of all the SO answers that warn against Regex to parse html I have a scenario where parsers and DOM tricks are not possible and need to use regex to remove a tag and contents which have a defined text value. For example in:
<div>foo bar</div
<a href="http://example.com">some text</a>
<div>foo bar foo bar</div>
I'm currently using this function to parse out matching links
/**
* Removes links from html text
* @param {string} html The html to be cleaned.
* @param {string} exclude The string of link text to remove.
* @returns {string} Cleaned html.
*/
function cleanBody(html, exclude){
html = html.replace(/\r?\n|\r|\t|/g, '');
var re = '<a\\b[^>]*>('+exclude+')<\\/a>';
return html.replace(new RegExp(re,'ig'),"");
}
In the example above I'd pass the html and string 'some text' to remove it. This works well for my scenario until the includes other markup e.g.
<div>foo bar</div
<a href="http://example.com"><font color="#1122cc">some text</font></a>
<div>foo bar foo bar</div>
How can I improve the Regex (or function) to account for additional markup (without using DOM, jQuery or other libraries)?
Upvotes: 1
Views: 1420
Reputation: 4612
The following regular expression should work for the specific case you presented:
var re="<a\\b[^>]*>(<[^>]+>)*("+exclude+")(<(?!/a>)[^>]+>)*</a>";
(<[^>]+>)*
(<(?!/a>)[^>]+>)*
Please realize that this regular expression is still not very "smart" in the way it is doing its job. It will not try to match balanced tags or filter invalid valid tag names, so the following invalid HTML will be matched:
<a href="http://example.com">some text</font></span></div></a>
<a href="http://example.com"><div>some text</font></span></div></a>
<a href="http://example.com"><foo>some text</div></a>
In addition, note that the following invalid HTML is matched only up to the closing anchor tag:
<a href="http://example.com"><div>some text</font></a></div>
The closing </div>
will not be matched.
Be careful about nested anchors. The following will match (noting that only one closing anchor tag is matched):
<a href="http://foo.org"><a href="http://example.com">some text</a>
There may be other data that unexpectedly matches this pattern that I have not thought of.
On the plus side, the nested tags don't have to wrap the exclude text. The following will be matched:
<a href="http://example.com"><span></span>some text<div></div></a>
<a href="http://example.com">some text<font></font></a>
There are a few opportunities to make the regex a little more flexible and/or safe, but that's beyond the scope of what you asked.
Upvotes: 1