mhawksey
mhawksey

Reputation: 2053

Remove href link from html that matches text ignoring inner link tags using regex

Aware of all the SO answers that warn against Regex to parse html I have a scenario where parsers and DOM tricks are not possible and need to use regex to remove a tag and contents which have a defined text value. For example in:

<div>foo bar</div
<a href="http://example.com">some text</a>
<div>foo bar foo bar</div> 

I'm currently using this function to parse out matching links

/**
 * Removes links from html text
 * @param {string} html The html to be cleaned.
 * @param {string} exclude The string of link text to remove.
 * @returns {string} Cleaned html.
 */
function cleanBody(html, exclude){
  html = html.replace(/\r?\n|\r|\t|/g, '');
  var re = '<a\\b[^>]*>('+exclude+')<\\/a>';
  return html.replace(new RegExp(re,'ig'),"");
}

In the example above I'd pass the html and string 'some text' to remove it. This works well for my scenario until the includes other markup e.g.

<div>foo bar</div
<a href="http://example.com"><font color="#1122cc">some text</font></a>
<div>foo bar foo bar</div> 

How can I improve the Regex (or function) to account for additional markup (without using DOM, jQuery or other libraries)?

Upvotes: 1

Views: 1420

Answers (1)

RobertB
RobertB

Reputation: 4612

The following regular expression should work for the specific case you presented:

var re="<a\\b[^>]*>(<[^>]+>)*("+exclude+")(<(?!/a>)[^>]+>)*</a>";
  • after the match for the opening anchor tag, add a pattern that matches zero or more tags, whether they are opening tags or closing tags, valid or invalid: (<[^>]+>)*
  • after the match for the exclude text, add a pattern that matches zero or more tags, whether they are opening tags or closing tags, valid or invalid, but--using a negative lookahead--do not match the closing anchor tag: (<(?!/a>)[^>]+>)*

Please realize that this regular expression is still not very "smart" in the way it is doing its job. It will not try to match balanced tags or filter invalid valid tag names, so the following invalid HTML will be matched:

<a href="http://example.com">some text</font></span></div></a>
<a href="http://example.com"><div>some text</font></span></div></a>
<a href="http://example.com"><foo>some text</div></a>

In addition, note that the following invalid HTML is matched only up to the closing anchor tag:

<a href="http://example.com"><div>some text</font></a></div>

The closing </div> will not be matched.

Be careful about nested anchors. The following will match (noting that only one closing anchor tag is matched):

<a href="http://foo.org"><a href="http://example.com">some text</a>

There may be other data that unexpectedly matches this pattern that I have not thought of.

On the plus side, the nested tags don't have to wrap the exclude text. The following will be matched:

<a href="http://example.com"><span></span>some text<div></div></a>
<a href="http://example.com">some text<font></font></a>

There are a few opportunities to make the regex a little more flexible and/or safe, but that's beyond the scope of what you asked.

Upvotes: 1

Related Questions