Reputation: 89

JavaScript Regex Exclude + Include pattern match

I am using JavaScript RegExp for search highlighting on HTML content.

To do that I am using:

data.replace( new RegExp("("+search+")", 'g'), "<b id='searchHighlight'>$1</b>" );

where data is the whole of the HTML content and search is the search string.

When searching for, e.g., h, it would highlight h in words (the, there, etc...) along with instances in tags like "<h1 id="title"> Something </h1>", etc.

I can't go for an alternative approach since I need to highlight the same HTML content with the same style.

I have read solutions like:

var input = "a dog <span class='something'> had a  </span> and a cat";
// Remove anything tag-like
var temp = input.replace(/<.+?>/g, "");
// Perform the search
var matches = new RegExp(exp, "g").exec(temp);

But since I need to highlight the search text in the same HTML content, I can't simply strip out the existing tags. Is there any way to do a include and exclude search in RegExp, so that I could, for example, highlight h in "the" with "t<b id='searchHighlight'>h</b>e"
and not allow "<h1 id="title">Test</h1>" to get corrupted thus: "<<b id='searchHighlight'>h</b>1 id="title">Test</<b id='searchHighlight'>h</b>1>"?

The HTML content is static and looks like this:

    <h1 id="title">Samples</h1>
        <div id="content">
            <div  class="principle">
        <h2 id="heading">           
            PRINCIPLE</h2>


        <p>
            FDA recognizes that samples are an important part of ensuring that the right drugs are provided to the right patients. Under the Prescription Drug Marketing Act (PDMA), a sales representative is permitted to provide prescription drug samples to eligible healthcare professionals (HCPs). In order for BMS to provide this service, representatives must strictly abide by all applicable compliance standards pertaining to the distribution of samples.</p></div>
<h2 id="heading">           
            WHY DOES IT MATTER?</h2>
        <p>
            The Office of Inspector General (OIG) recognizes that samples can have monetary value to HCPs and, when used improperly, may have implications under the Federal False Claims Act and the Federal Anti-kickback Act. To minimize risk of such liability, the OIG requires the clear and conspicuous labeling of individual samples as units that cannot be sold.&nbsp; BMS and its business partners label every sample package to meet this requirement.&nbsp; Additionally, the HCP signature statement acknowledges that the samples will not be sold, billed or provided to family members or friends.</p>
        <h2 id="heading">

            WHO IS YOUR SMaRT PARTNER?</h2>
        <p>
            SMaRT is an acronym for &ldquo;Samples Management and Representatives Together&rdquo;.&nbsp; A SMaRT Partner has a thorough understanding of BMS sample requirements and is available to assist the field with any day-to-day policy or procedure questions related to sample activity. A SMaRT Partner will also:</p>

        <ul>
            <li style="margin-left:22pt;"> Monitor your adherence to BMS&rsquo;s sample requirements.</li>
            <li style="margin-left:22pt;"> Act as a conduit for sharing sample compliance issues and best practices.</li>
            <li style="margin-left:22pt;"> Respond to day-to-day sample accountability questions within two business days of receipt.</li>
        </ul>
        <p>

            Your SMaRT Partner can be reached at 888-475-2328, Option 3.</p>
        <h2 id="heading">

            BMS SAMPLE ACCOUNTABILITY POLICIES &amp; PROCEDURES</h2>
        <p>
            It is the responsibility of each sales representative to read, understand and follow the BMS Field Sample Accountability Procedures, USPSM-SOP-101. The basic expectations are:</p>
        <ul>
            <li style="margin-left:22pt;"> Transmit all sample activity by communicating your tablet to the host server on a <strong>daily</strong> basis.</li>
            <li style="margin-left:22pt;"> Maintain a four to six week inventory of samples rather than excessive, larger inventories that are more difficult to manage and increase your risk of non-compliance.</li>
            <li style="margin-left:22pt;"> Witness all HCP&rsquo;s signatures to confirm request and receipt of samples.</li>
        </ul>
</div>

The contents are all scattered and not in just one tag. So DOM manipulation is not a solution for me.

Upvotes: 3

Answers (3)

guypursey

Reputation: 3194

This isn't a pure RegExp solution but, if you can't traverse the DOM, then string manipulation with functional replaces and loops like this could work for you.

Declare the variables you need and fetch the innerHTML of your document body.
Look through the data extracting any tags and saving them in an array for now. Leave a placeholder so you know where to put them back later.
With all the tags replaced with temporary placeholders in your string, you can then replace the characters you need to, using your original code but assigning the result back to data.
Then you would need to restore the tags by reversing the earlier process.
Assign the new data as the innerHTML of your document body.

This is the process in action.

Here is the code:

var data = document.body.innerHTML, // get the DOM as a string
    tagarray = [], // a place to temporarily store all your tags
    tagmatch = /<[^>]+>/g, // for matching tags
    tagplaceholder = '<>', // could be anything but should not match the RegExp above, and not be the same as the search string below
    search = 'h'; // for example; but this could be set dynamically

while (tagmatch.test(data)) {
    data = data.replace(tagmatch, function (str) {
        tagarray.push(str); // store each matched tag in your array
        return tagplaceholder; // whatever your placeholder should be
    });
}

data = data.replace( new RegExp("("+search+")", 'g'), "<b id='searchHighlight'>$1</b>" ); // now search and replace the string of your choice

while (new RegExp(tagplaceholder, 'g').test(data)) {
    data = data.replace(tagplaceholder, function (str) {
        return tagarray.shift(str); // replace the placeholders with the tags you saved earlier to restore them
    });
}

document.body.innerHTML = data; // assign the changed `data` string to the body

Obviously if you can put this all in a function of its own, so much the better, as you don't really want global variables like the above hanging around.

Upvotes: 0

MikeM

Reputation: 13641

If you can be sure there are no < or > in a tag's attributes, you could just use

data = data.replace( 
    new RegExp( "(" + search + "(?![^<>]*>))", 'g' ),
        "<b id='searchHighlight'>$1</b>" );

The negative look-ahead (?![^<>]*>) prevents the replacement if > appears before < ahead in the string, as it would if inside a tag.

This is far from fool-proof, but it may be good enough.

BTW, as you are matching globally, i.e. making more than one replacement, id='searchHighlight' should probably be class='searchHighlight'.

And you need to be careful that search does not contain any regex special characters.

Upvotes: 4

collapsar

Reputation: 17258

you're probably aware of the fact that you try to employ the wrong tool for the job, so this is just for the record (in case you're not, you may find this insightful).

you might (most certainly will?) encounter one fundamental problem on html attributes with basically arbitrary textual content, namely title (the tooltip attribute) and data-... (generic user-defined attributes to hold arbitrary data by design) - whatever you find in the textual part of your html code, you could find there too, the replacement on which will deface balloon help and/or wreck some application logic. also note that any character of the textual content may be encoded as named or numerical entity (e.g. & -> &, &, &), which can be handled in principle but will complicate the dynamic regex (vastly in case your variable search will hold straight text).

having said all this, you MIGHT get along with data.replace( new RegExp("([>]?)[^><]*("+search+")[^><]*([<]?)", 'g'), "<b id='searchHighlight'>$1$2$3</b>" ); unless search results to be highlighted may contain characters that have semantics in regex specifications, like .+*|([{}])\, perhaps -; these you'd have to escape properly.

in summary: revise your design to save you from LOTS of trouble.

btw, why wouldn't you opt for dom traversal? you don't need to know about the actual html tags present to do that.

Upvotes: 1

JavaScript Regex Exclude + Include pattern match

Answers (3)

Related Questions