Reputation: 96581
Consider 3 regex expressions designed to remove non Latin characters from the string.
String x = "some†¥¥¶¶ˆ˚˚word";
long now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms
System.out.println(System.nanoTime() - now);
now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms
System.out.println(System.nanoTime() - now);
now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms
System.out.println(System.nanoTime() - now);
All 3 produce the same result with vastly difference performance metrics.
Why is that?
Upvotes: 4
Views: 69
Reputation: 4438
The last one will replace empty strings with empty strings (unless that is optimized away, I do not know the compiler) which seems a bit unnecessary... ;-)
The first one will search much more times than the second if non-latin chars are adjecent. Otherwise not. So I guess the time for 1 and 2 might be roughly the same on some texts and longer for 1 on other texts.
Upvotes: 1
Reputation: 32596
The first one is slower because the regex matches each non-latin character individually, so replaceAll
operates on each characters individually.
The other patterns match the whole sequence of non-latin characters, so replaceAll
can replace the whole sequence in one go. I can't explain the performance difference between these two, though. Probably something to do with the difference in handling *
and +
in the regex engine.
Upvotes: 1