Reputation: 1659

How do you match all quotes NOT contained with HTML tags?

In the following string...

var str = 'Foobar is so "awesome."  I <span prop="nifty">"really"</span> <span prop="attr">think it is so</span> <span prop="nifty" prop="attr">"cool!"</span>'

...how would I write a regular expression that matches the quotes (") around the words awesome, really, and cool, while NOT matching the quotes within the HTML tags?

I'm using JavaScript's replace function to replace the quotes with "

I'm hoping there's a regular expression that I can use such that...

str.replace(/regex-magic/g, "&#34;")

..gives me the output...

Foobar is so `&#34;`awesome.`&#34;`  I <span prop="nifty">`&#34;`really`&#34;`</span> <span prop="attr">think it is so</span> <span prop="nifty" prop="attr">`&#34;`cool!`&#34;`</span>

Thanks much!!

Upvotes: 0

Answers (4)

Alan Moore

Reputation: 75222

str = str.replace(/"(?![^<>]*>)/g, "&#34;");

(?![^<>]*>) is a negative lookahead (ref). It scans forward from the current position (in this case, after a quote has been matched) looking for a closing angle bracket (>). If it finds one without seeing an opening bracket (<) first, it must be inside an HTML tag, so the match fails.

var str = 'Foobar is so "awesome."  I <span prop="nifty">"really"</span> <span prop="attr">think it is so</span> <span prop="nifty" prop="attr">"cool!"</span>';
str = str.replace(/"(?![^<>]*>)/g, "&#34;");
alert(str);

As the other responders said, it's best to process HTML as HTML whenever possible (and it usually is possible). Processing it character-by-character like this, it's much too easy to introduce errors, even if you're an expert with whatever tool you're using.

Upvotes: 1

Rick Hitchcock

Reputation: 35670

For problems like this, I find it easier to process individual text nodes rather than struggle through regex syntax.

Assuming your string is not within a DOM element, you can easily create an element, and simply not attach it to the DOM.

My function below iterates through the child nodes. If the child is a text node, it changes " to ". Otherwise, it calls itself recursively with the child. The output is then stored in a textarea:

function replaceQuotes(d) {
  var cn= d.childNodes;
  for(var i = 0 ; i < cn.length ; i++) {
    if(cn[i].nodeValue) {
      cn[i].nodeValue= cn[i].nodeValue.replace(/"/g,'&#34;');
    }
    else {
      replaceQuotes(cn[i]);
    }
  }
}

var str = 'Foobar is so "awesome."  I <span prop="nifty">"really"</span> <span prop="attr">think it is so</span> <span prop="nifty" prop="attr">"cool!"</span>'

var d= document.createElement('div');
d.innerHTML= str;
replaceQuotes(d);

document.querySelector('textarea').innerHTML= d.innerHTML;

textarea {
  width: 80%;
  height: 100px;
}

<textarea></textarea>

Upvotes: 0

willeM_ Van Onsem

Reputation: 476584

As always, it is a very bad idea to do HTML/XML processing using regular expressions.

Anyway, I guess one can use the following regular expression:

([^<]*<[^>]*>[^<]*)*?\"(.*?)\"

The first group is used to ensure that every opened tag, is closed as well. The second group ensures you match anything between the quotes.

If you however, want to do it properly, you can use tidy to convert it to an xml file and then use for instance xmlint to perform XPath queries. I'm sure Javascript has such tools as well.

Example (in bash):

$ echo 'Foobar is so "awesome."  I <span prop="nifty">"really"</span> <span prop="attr">think it is so</span> <span prop="nifty" prop="attr">"cool!"</span>' | tidy -asxhtml -numeric 2>/dev/null | xmllint --html --xpath 'normalize-space(/)' - | grep -P -o '".*?"'
"awesome."
"really"
"cool!"

Upvotes: 5

Brad

Reputation: 163234

What I would do is use a DOM parser to read the whole document, and then output the whole document as valid HTML. Then you don't even have to mess with it, and you'll be getting the best possible interpretation of your invalid ambiguous HTML.

Upvotes: 2

How do you match all quotes NOT contained with HTML tags?

Answers (4)

Related Questions