RobKohr
RobKohr

Reputation: 6943

Using regex to replace characters between two strings while ignoring html tags and new line breaks

I need to redact health information from emails that are loaded into a string variable by replacing characters with █. The emails in question need content in between the words "health issues?" and "Have you worked" replaced but ignoring anything that appears in tags. Additionally lines often are wrapped with with = signs, and those new line, spaces, and = signs can occur right in the middle of a tag, and they can also occur in the middle of the strings used to identify the start and end.

Example:

(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
 <li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text. 
Have
     you worked.....(more content)

I am figuring there is a way to do this in javascript using one or more regular expressions, but I am at a loss to see how.

The desired result would look like:

(More content)
.....have any health issues?███████████████████████████████████████████</sp=
an></li>
 <li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">███████████████
Have
     you worked.....(more content)

Upvotes: 0

Views: 227

Answers (1)

revo
revo

Reputation: 48741

You could use two replace methods to solve this problem. The first one matches every thing from health issues? to Have you worked captured into three capturing groups. We are interested in second capturing group:

(health issues\?)([\s\S]*?)(Have\s+you\s+worked)
                  ^^^^^^^^

We run our second replace method on this captured group and substitutes each character outside of tags with a . This is the regex:

(<\/?\w[^<>]*>)|[\s\S]

We need to keep first capturing group (they are probably HTML tags) and replace the other side of alternation ([\s\S]) with the mentioned character.

Disclaimer: this is not bulletproof as regex shouldn't be used to parse HTML tags.

Demo:

var str = `(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
 <li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text. 
Have
     you worked.....(more content)`;

console.log(str.replace(/(health issues\?)([\s\S]*?)(Have\s+you\s+worked)/, function(match, $1, $2, $3) {
    return $1 + $2.replace(/(<\/?\w[^<>]*>)|[\s\S]/g, function(match, $1) {
        return $1 ? $1 : '█';
    }) + $3;
}));

Upvotes: 1

Related Questions