AjmeraInfo
AjmeraInfo

Reputation: 504

Regex match all email except within anchor tag one

I need to find all email addresses from content with html or without html and need to replace with link.

I have below regex for email address find and it's working perfect.

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Here is demo link to work on with sample data: https://regexr.com/3v12e

Here is anchor tag regex (?:(<a[^>]*>([^<]+)<\/a>))

So how can find all email address except within anchor tag one:

enter image description here

Upvotes: 2

Views: 319

Answers (1)

Julio
Julio

Reputation: 5308

You may use something similar to the trash bin trick.

You basically search for 3 cases: an 'a' tag, an email, and 'the rest'. You assign a capturing group to any of those 3 cases. Then depending on wether those groups are empty or not, you can do different things. So, this structure: (A_TAG)|(EMAIL)|([\s\S]) (where [\s\S] means any character including new lines)

It is ought to be said that the order is important: You want the first group to be the 'a' tag for discarding it fast. The 'any character' ([\s\S]) must be the last option, since if It would be the first one, It would match anything and would give no chance for the other options to match any text.

const regex = /(?:(<a[^>]*>(?:[^<]+)<\/a>))|((?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]))|([\s\S])/gm;
const str = `[email protected]
[email protected]
For more information [email protected] about I Can Read, 
[email protected] please refer to <a href="mailto:[email protected]">[email protected]</a>our website [email protected]
[email protected]
 [email protected]
    [email protected]
	 [email protected]
	 
	 sdfsdf [email protected]
[email protected] sdfsdfsdf`;
let m;

let acc = '';
while ((m = regex.exec(str)) !== null) {
    if (typeof(m[1])!='undefined') {
        //First group is defined: it will have a <a> tag
        //So we just add it to the acumulator as-is.
        acc += m[1];
    }
    else if (typeof(m[2])!='undefined') {
        //Second group is defined: it will have an email
        //we change it
        acc += '<a href="mailto:' + m[2] + '">' + m[2] + '</a>';
    }
    else {
        //Any other character. We just add to the accumulator
        acc += m[3];
    }
}
console.log(acc);

Also, here you can find a demo, just to see visually the different capturing groups. Of course, for the replacements, you would need the extra logic described above.

Upvotes: 1

Related Questions