Sukasa
Sukasa

Reputation: 1700

Use regex to find specific string not in html tag

I'm having some difficulty with a specific Regex I'm trying to use. I'm searching for every occurrence of a string (for my purposes, I'll say it's "mystring") in a document, EXCEPT where it's in a tag, e.g.

<a href="_mystring_">

should not match, but

<a href="someotherstring">_mystring_</a>

Should match, since it's not inside a tag (inside meaning "inside the < and > markers") I'm using .NET's regex functions for this as well.

Upvotes: 27

Views: 20586

Answers (8)

Ricky Leung
Ricky Leung

Reputation: 21

_mystring_(?![^<]*?>)

But a valid HTML structure is required.

Upvotes: 1

John Camden
John Camden

Reputation: 632

When your regex processor doesn't support variable length look behind, try this:

(<.+?>[^<>]*?)(_mystring_)([^<>]*?<.+?>)

Preserve capture groups 1 and 3 and replace capture group 2:

For example, in Eclipse, find:

(<.+?>[^<>]*?)(_mystring_)([^<>]*?<.+?>)

and replace with:

$1_newString_$3

(Other regex processors might use a different capture group syntax, such as \1)

Upvotes: 14

sbonami
sbonami

Reputation: 1922

Another regex to search that worked for me

(?![^<]*>)_mystring_

Source: https://stackoverflow.com/a/857819/1106878

Upvotes: 15

bobs12
bobs12

Reputation: 21

A quick and dirty alternative is to use a regex replace function with callback to encode the content of tags (everything between < and >), for example using base64, then run your search, then run another callback to decode your tag contents.

This can also save a lot of head scratching when you need to exclude specific tags from a regex search - first obfuscate them and wrap them in a marker that won't match your search, then run your search, then deobfuscate whatever is in markers.

Upvotes: 2

Nick Higgs
Nick Higgs

Reputation: 1702

This should do it:

(?<!<[^>]*)_mystring_

It uses a negative look behind to check that the matched string does not have a < before it without a corresponding >

Upvotes: 40

LBushkin
LBushkin

Reputation: 131676

Regular expression searches are typically not a good idea in XML. It's too easy to run into problems with search expressions matching to much or too little. It's also almost impossible to formulate a regex that can correctly identify and handle CDATA sections, processing instructions (PIs), and escape sequences that XML allows.

Unless you have complete control over the XML content you're getting and can guarantee it won't include such constructs (and won't change) I would advise to use an XML parser of some kind (XDocument or XmlDocument in .net, for instance).

Having said that, if you're still intent on using regex as your search mechanism, something like the following should work using the RegEx class in .NET. You may want to test it out with some of your own test cases at a site like Regexlib. You may also be able to search their regular expression catalog to find something that might fit your needs.

[>].(_mystring_).[<]

Upvotes: -2

cdm9002
cdm9002

Reputation: 1960

Ignoring that are there indeed other ways, and that I'm no real regex expert, but one thing that popped into my head was:

  • find all the mystrings that ARE in tags first - because I can't write the expression to do the opposite :)
  • change those to something else
  • then replace all the other mystring (that are left not in tags) as you need
  • restore the original mystrings that were in tags

So, using <[^>]*?(mystring)[^>]*> you can find the tagged ones. Replace those with otherstring. Do you normal replace on the mystrings that are left. Replace otherstring back to mystring

Crude but effective....maybe.

Upvotes: 0

Marc Gravell
Marc Gravell

Reputation: 1062590

Why use regex?

For xhtml, load it into XDocument / XmlDocument; for (non-x)html the Html Agility Pack would seem a more sensible choice...

Either way, that will parse the html into a DOM so you can iterate over the nodes and inspect them.

Upvotes: 1

Related Questions