Anh Tú
Anh Tú

Reputation: 636

Get match string not in a tag

I want to get string hello world from an html string like this:

Hello world! hello world! Hello world! <a href="#">hello world</a><p>hello world</p><p><a href="#">hello world</a></p>

But I don't want to get hello world in a tag. Example:

<a href="#">hello world</a>

and

<p><a href="#">hello world</a></p>

will not match.

My code:

var replacepattern = new RegExp('hello world(?![^<]*>)',"ig");

returns all hello worlds in the string. Any ideas?

EDIT:

I use (?![^<]*>) in case: <p title="hello world"> hello world</p>. So I don't get the hello worlds in tag attributes

EDIT 2:

I want to return the string:

'<a href="#hello world">Hello world</a>! <a href="#hello world">Hello world</a>! <a href="#hello world">Hello world</a>! <a href="#">Hello world</a><p><a href="#hello world">Hello world</a></p><p><a href="#">Hello world</a></p>'

Upvotes: 0

Views: 723

Answers (4)

sparanoid
sparanoid

Reputation: 1558

Most browsers support negative lookahead now you can try this:

(?![^>]*<\/[a-zA-Z]>)(Hello world)

Demo: https://regex101.com/r/rDPp0t/2/

Upvotes: 0

Ro Yo Mi
Ro Yo Mi

Reputation: 15000

Description

This expression will:

  • allow you to replace only the hello world substrings which are outside the anchor tags
  • avoid difficult edge cases which makes pattern matching in HTML difficult
  • does not use atomic groups as they are not allowed in Javascript

Regex

((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)(hello\sworld\s\d+)((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)

Full Explaination

Theory:

  • ((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*) Captures the anchor tags, and any text outside the anchor tags which is not hello world. This is group 1
  • (hello\sworld\s\d+) Captures the hello world. This is group 2. Since I added digits in my sample text to help show which sub strings were being captured, I also added the \s\d+ to this section. Yes arguably this beyond your original scope. :)
  • ((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*) Captures the anchor tags, and any text outside the anchor tags which is not hello world. This is group 3. It's an identical pattern to group 1, but is required or else you might encounter odd results on the last match in the string.

Replace With

In the samples below I used this replacement to help make it more obvious what's happening:

$1_______$3

You could use this to replace your hello world strings with anchor tags with this:

$1<a href="$2">$2</a>$3

enter image description here

Examples

Sample text

Note the difficult edge cases in the anchor tag with the onmouseover attribute. I also added numbers to each of the hello worlds so they are easier for us humans to read.

<a href="#">hello world 00</a>Hello world 1! hello world 2! Hello world 3! <a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href="#">hello world 04</a><p>hello world 5</p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

Sample Javascript

<script type="text/javascript">
  var re = /((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)(hello\sworld\s\d+)((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)/;
  var sourcestring = "source string to match with pattern";
  var replacementpattern = "$1<a href="$2">$2</a>$3";
  var result = sourcestring.replace(re, replacementpattern);
  alert("result = " + result);
</script>

String After Replacement

This is just to show what's happening, using the first "replace with"

<a href="#">hello world 00</a>_______! _______! _______! <a href="#">hello world 04</a><p>_______</p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

This is using the second "replace with" to show how that it actually works

<a href="#">hello world 00</a><a href="Hello world 1">Hello world 1</a>! <a href="hello world 2">hello world 2</a>! <a href="Hello world 3">Hello world 3</a>! <a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href="#">hello world 04</a><p><a href="hello world 5">hello world 5</a></p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

Upvotes: 1

Krasimir
Krasimir

Reputation: 13529

I think that this will work:

var str = 'Hello > world <! Hello > world <! Hello > world <! <a href="#">Hello > world <</a><p>Hello > world <</p><p><a href="#">Hello > world <</a></p>';
var textToReplace = 'Hello > world <'
var re = new RegExp('(?!(^<*(href=)*(>)))' + textToReplace + '(?!(</a>))',"ig");
var result = str.replace(re, '@');
console.log(result);

The result is

@! @! @! <a href="#">Hello > world <</a><p>@</p><p><a href="#">Hello > world <</a></p> 

Is that what you want to achieve?

JsFiddle -> http://jsfiddle.net/Che3v/1/

Upvotes: -1

Benjamin Gruenbaum
Benjamin Gruenbaum

Reputation: 276306

Let's say you got that HTML in a string:

var str = 'Hello world! hello world! Hello world! <a href="#">hello world</a><p>hello world</p><p><a href="#">hello world</a></p>';

Instead of coming up with complicated REGEX patterns to match it, we'll put that HTML in an HTML container and use the powerful DOM api built into every browser with JavaScript to process it.

var el = document.createElement("div");
el.innerHTML = str;

Now, let's get all a tags from our element, and remove them ourselves

var aTags = el.getElementsByTagName("a");
while(aTags.length > 0){ // while the element still has a tags 
    aTags[0].parentNode.removeChild(aTags[0]); //remove
}

Now, we can get the HTML back and get the correct text content

el.innerHTML; 

This now is:

"Hello world! hello world! Hello world! <p>hello world</p><p></p>"

Now, if we just want the text without the tags, we can do that too.

el.textContent;

Will evaluate to:

"Hello world! hello world! Hello world! hello world"

Upvotes: 1

Related Questions