Remove href link from html that matches text ignoring inner link tags using regex

Question

Aware of all the SO answers that warn against Regex to parse html I have a scenario where parsers and DOM tricks are not possible and need to use regex to remove a tag and contents which have a defined text value. For example in:

foo barsome text
foo bar foo bar

I'm currently using this function to parse out matching links

/**
 * Removes links from html text
 * @param {string} html The html to be cleaned.
 * @param {string} exclude The string of link text to remove.
 * @returns {string} Cleaned html.
 */
function cleanBody(html, exclude){
  html = html.replace(/
?
|
|	|/g, '');
  var re = ']*>('+exclude+')<\/a>';
  return html.replace(new RegExp(re,'ig'),"");
}

In the example above I'd pass the html and string 'some text' to remove it. This works well for my scenario until the includes other markup e.g.

foo bar
some text
foo bar foo bar

How can I improve the Regex (or function) to account for additional markup (without using DOM, jQuery or other libraries)?

RobertB · Accepted Answer

The following regular expression should work for the specific case you presented:

var re="]*>(<[^>]+>)*("+exclude+")(<(?!/a>)[^>]+>)*";

after the match for the opening anchor tag, add a pattern that matches zero or more tags, whether they are opening tags or closing tags, valid or invalid: (<[^>]+>)*
after the match for the exclude text, add a pattern that matches zero or more tags, whether they are opening tags or closing tags, valid or invalid, but--using a negative lookahead--do not match the closing anchor tag: (<(?!/a>)[^>]+>)*

Please realize that this regular expression is still not very "smart" in the way it is doing its job. It will not try to match balanced tags or filter invalid valid tag names, so the following invalid HTML will be matched:

some text
some text
some text

In addition, note that the following invalid HTML is matched only up to the closing anchor tag:

some text

The closing will not be matched.

Be careful about nested anchors. The following will match (noting that only one closing anchor tag is matched):

some text

There may be other data that unexpectedly matches this pattern that I have not thought of.

On the plus side, the nested tags don't have to wrap the exclude text. The following will be matched:

some text
some text

There are a few opportunities to make the regex a little more flexible and/or safe, but that's beyond the scope of what you asked.

Remove href link from html that matches text ignoring inner link tags using regex

Answers (1)

Related Questions