Reputation: 7839

JavaScript RegEx to match punctuation NOT part of any HTML tags

Okay, I know there's much controversy with matching and parsing HTML within a RegEx, but I was wondering if I could have some help. Case and Point.

I need to match any punctuation characters e.g . , " ' but I don't want to ruin any HTML, so ideally it should occur between a > and a < - essentially my query isn't so much about parsing HTML, as avoiding it.

I'm going to attempt to replace wrap each instance in a <span></span> - but having absolutely no experience in RegEx, I'm not sure I'm able to do it.

I've figured character sets [\.\,\'\"\?\!] but I'm not sure how to match character sets that only occur between certain characters. Can anybody help?

Upvotes: 3

Answers (3)

zx81

Reputation: 41838

Dan, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

The Dom parser solution was great. With all the disclaimers about using regex to parse html, I'd like to add a simple way to do what you wanted with regex in Javascript.

The regex is very simple:

<[^>]*>|([.,"'])

The left side of the alternation matches complete tags. We will ignore these matches. The right side matches and captures punctuation to Group 1, and we know they are the right punctuation because they were not matched by the expression on the left.

On this demo, looking at the lower right pane, you can see that only the right punctuation is captured to Group 1.

You said you wanted to embed the punctuation in a <span>. This Javascript code will do it. I've replaced the <tags> with {tags} to make sure the example displays in the browser.

<script>
var subject = 'true ,she said. {tag \" . ,}';
var regex = /{[^}]*}|([.,"'])/g;
replaced = subject.replace(regex, function(m, group1) {
    if (group1 == "" ) return m;
    else return "&lt;span&gt;" + group1 + "&lt;/span&gt;";
});
document.write(replaced);
</script>

Here's a live demo

Reference

Upvotes: 0

Elias Van Ootegem

Reputation: 76395

To start off, here's a X-browser dom-parser function:

var parseXML = (function(w,undefined)
{
    'use strict';
    var parser,ie = false;
    switch (true)
    {
        case w.DOMParser !== undefined:
            parser = new w.DOMParser();
        break;
        case new w.ActiveXObject("Microsoft.XMLDOM") !== undefined:
            parser = new w.ActiveXObject("Microsoft.XMLDOM");
            parser.async = false;
            ie = true;
        break;
        default :
            throw new Error('No parser found');
    }
    return function(xmlString)
    {
        if (ie === true)
        {//return DOM
            parser.loadXML(xmlString);
            return parser;
        }
        return parser.parseFromString(xmlString,'text/xml');
    };
})(this);
//usage:    
var newDom = parseXML(yourString);
var allTags = newDom.getElementsByTagName('*');
for(var i=0;i<allTags.length;i++)
{
    if (allTags[i].tagName.toLowerCase() === 'span')
    {//if all you want to work with are the spans:
        if (allTags[i].hasChildNodes())
        {
            //this span has nodes inside, don't apply regex:
            continue;            
        }         
        allTags[i].innerHTML = allTags[i].innerHTML.replace(/[.,?!'"]+/g,'');
    }
}

This should help you on your way. You still have access to the DOM, so whenever you find a string that needs filtering/replacing, you can reference the node using allTags[i] and replace the contents.
Note that looping through all elements isn't to be recommended, but I didn't really feel like doing all of the work for you ;-). You'll have to check what kind of node you're handling:

if (allTags[i].tagName.toLowerCase() === 'span')
{//do certain things
}
if (allTags[i].tagName.toLowerCase() === 'html')
{//skip
    continue;
}

And that sort of stuff...
Note that this code is not tested, but it's a simplified version of my answer to a previous question. The parser-bit should work just fine, in fact here's a fiddle I've set up for that other question, that also shows you how you might want to alter this code to better suite your needs

Upvotes: 2

Nick

Reputation: 4362

Edit As Elias pointed out, native JScript doesn't support the lookaheads. I'll leave this up in case someone else looks for something similar, just be aware.

Here is the regex I got to work, it requires lookaheads and lookbehinds and I'm not familiar enough with Javascript to know if those are supported or not. Either way, here is the regex:

(?<=>.*?)[,."'](?=.*<)

Breakdown:

1. (?<=>.*?)  -->  The match(es) must have ">" followed by any characters
2. [,."']     -->  Matches for the characters:  ,  .  "  '
3. (?=.*<)    -->  The match(es) must have any characters then "<" before it

This essentially means it will match any of the characters you want in between a set of > <.

That being said, I would suggest as Point mentioned in the comments to parse the HTML with a tool designed for that, and search through the results with the regex [,."'].

Upvotes: 1

JavaScript RegEx to match punctuation NOT part of any HTML tags

Answers (3)

Related Questions