Shahab
Shahab

Reputation: 822

Regular expression to replace match keywords outside html tags AND anchor (a) tag text

I am developing an asp.net application. I want to add a keyword linking system.

I want to make the keyword a hyper-link to another page. But, I should not link the keyword if its currently linked (to any page). For example:

it is a <a href="http://www.somesite.com">linked keyword</a> and it should be a linked keyword.

should convert to:

it is a <a href="http://www.somesite.com">linked keyword</a> and it should be a linked <a href="http://newlycreatedLink.com">keyword</a>.

As you can see, the first keyword should be left intact.

Could you help me please to solve this problem?

I've found this link in asp.net forums. But I should tune the answer to exclude currently linked keywords. I've searched everywhere but found nothing.

Upvotes: 3

Views: 2224

Answers (2)

Jonny 5
Jonny 5

Reputation: 12389

To check if the keywords is "outside", look ahead

  • (?= if after the keyword there's an opening <tag or the $ end
  • [^<>]* any amount of characters, that are NOT > OR <
  • followed by (?:<\w|$) where \w is a shorthand to word-charcters [a-zA-Z_0-9]

So the pattern could look like:

String pattern = @"(?i)\bkeyword\b(?=[^<>]*(?:<\w|$))";

String replacement = @"<a href=\"http://newlycreatedLink.com\">\0</a>";

Put the keyword into word-boundaries \b and used (?i) i modifier for case insensitive.

So this would only replace keyword that is followed by an opening-tag or the end.


UPDATE: To replace keyword also "inside" tags, that don't end up with </a add |<\/[^a]:

String pattern = @"(?i)\bkeyword\b(?=[^<>]*(?:<\w|<\/[^a]|$))";

Upvotes: 2

Marius Schulz
Marius Schulz

Reputation: 16440

Don't use regular expressions for sophisticated HTML parsing like this. Use a proper HTML parser instead — here's why.

Upvotes: 1

Related Questions