DareToDead
DareToDead

Reputation: 47

Regex taking a lot of time when tested with large data

I have created a java regex

(?=\b((?<![^\p{Alnum}\p{Punct}])(?:\p{Alnum}+\p{Punct}\p{Alnum}+){2})\b)

I tested this regex against a sample string: https://www.google.com.google.com

It is giving me all the expected tokens:

www.google.com google.com.google com.google.com

But, the issue with the above regex is that it is taking a lot of time when tested with a large string.

My expected tokens are in the form of "alphanumeric punctuation alphanumeric".

How can I optimize this regex?

Upvotes: 2

Views: 61

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You need to simplify the regex like this:

(?=\b((?:\p{Alnum}+\p{Punct}){2}\p{Alnum}+)\b)

See the regex demo.

Details:

  • \b - a word boundary
  • ((?:\p{Alnum}+\p{Punct}){2}\p{Alnum}+) - Group 1:
    • (?:\p{Alnum}+\p{Punct}){2} - two occurrences of one or more letters/digits and a punctuation char and then
    • \p{Alnum}+ - one or more letters/digits
  • \b - a word boundary

Note that each subsequent pattern does not match at the same location inside the string, which makes it as efficient as it can be (still, the overlapping patterns performance is not that great since they must evaluate each position inside the string, from left to right).

Upvotes: 2

Related Questions