Is there a way to make this regex more efficient?

Question

I'm trying to write a bounce (NDR) handler.

Basically this regex pulls out an email address out of a piece of text.

[a-z0-9-._%+=!$~]+@[a-z0-9.-]+\.[a-z]{2,15}

Some NDR have a lot of text in it and requires a ton of steps according to regex101.com depending where the email is located in the block of test. Is their a better regex expression, that can be used to make it more efficient at searching for the email in the NDR?

bezmax · Accepted Answer

Backtracking is what eats your CPU usually. The easiest way to solve it - is to write such regex, that can identify not only the tokens you are looking for, but also invalid tokens. In this way you can elliminate backtracking and make it all much faster.

However, finding such token depends on input text, basically, you need to find a regexp that:

Covers as much as possible of "invalid" text
Backtracks as rarely as possible

If your NDR is plain text (as it usually is) then probably something like this should be optimal enough:

(?:[^@]+\s)

Basically it's a non-capturing group of anything that's not a "@" followed by a space. This means, it will pretty much "eat" all your text up to the email address forbidding the regex engine to backtrack onto those matched groups.

Full regex would look like this:

(?:[^@]+\s)([a-z0-9-._%+=!$~]+@[a-z0-9.-]+\.[a-z]{2,15})

Compare it in regex debugger at regex101.com to see the difference in how it works.

Edit: A more appropriate solution would be to check for [^a-z0-9-._%+=!$~] instead of \s. So the final version would be:

(?:[^@]+[^a-z0-9-._%+=!$~])([a-z0-9-._%+=!$~]+@[a-z0-9.-]+\.[a-z]{2,15})

Sidenote: If you are considering optimizing such a simple regex query, there is a big chance that regex is not the right tool for you. Throw in some simple custom code and it will be much faster and easier understand and debug than some monstrous regex. And the more you try to optimize it - the bigger monstrosity it will become.

Is there a way to make this regex more efficient?

Answers (1)

Related Questions