millebii
millebii

Reputation: 1287

Does Java regex optimize this specific case?

I wonder how does regex work, my particular regex has an element that looks like this:

(word1|word2|wordn......)

The numbers of words is big several hundreds.
I wonder if the regex engine is just testing the words one by one or if it optimizes the search and it what way.
Any pointer to good documentation will be good.

Upvotes: 5

Views: 318

Answers (2)

aioobe
aioobe

Reputation: 421340

If it seems to you that the RE engine is the bottleneck in such search, you could easily construct a trie and check for containment.

Upvotes: 1

Yoni
Yoni

Reputation: 10321

If you have several hundred words, you need to beware of the ordering of the words in the regex. The regex engine looks for the words from left to right.
If you test the word setValue against the alternation set|setValue, it will match only the 3 letters comprising "set", and not the whole string.

See this link (from www.regular-expressions.info) for the full explanation.

I don't think that the regex engine truly optimizes alternations (i.e., analyzing common prefixes and building nfa accordingly). Therefore, with so many words, I don't think it will be an optimization.

Aside from re-ordering the words, you can also try adding word or line boundary after the alternation, e.g. (set|setValue)$, but I suspect that the regex engine will do a lot of backtracking so it may not be worth the effort.

Upvotes: 1

Related Questions