James Raitsev
James Raitsev

Reputation: 96581

Comparing performance of different regexes, clarification needed

Consider 3 regex expressions designed to remove non Latin characters from the string.

    String x = "some†¥¥¶¶ˆ˚˚word";

    long now = System.nanoTime();
    System.out.println(x.replaceAll("[^a-zA-Z]", ""));     // 5ms
    System.out.println(System.nanoTime() - now);

    now = System.nanoTime();
    System.out.println(x.replaceAll("[^a-zA-Z]+", ""));    // 2ms
    System.out.println(System.nanoTime() - now);

    now = System.nanoTime();
    System.out.println(x.replaceAll("[^a-zA-Z]*", ""));    // <1ms
    System.out.println(System.nanoTime() - now);

All 3 produce the same result with vastly difference performance metrics.

Why is that?

Upvotes: 4

Views: 69

Answers (2)

Leo
Leo

Reputation: 4438

The last one will replace empty strings with empty strings (unless that is optimized away, I do not know the compiler) which seems a bit unnecessary... ;-)

The first one will search much more times than the second if non-latin chars are adjecent. Otherwise not. So I guess the time for 1 and 2 might be roughly the same on some texts and longer for 1 on other texts.

Upvotes: 1

Andrew Cooper
Andrew Cooper

Reputation: 32596

The first one is slower because the regex matches each non-latin character individually, so replaceAll operates on each characters individually.

The other patterns match the whole sequence of non-latin characters, so replaceAll can replace the whole sequence in one go. I can't explain the performance difference between these two, though. Probably something to do with the difference in handling * and + in the regex engine.

Upvotes: 1

Related Questions