Eddie Deng
Eddie Deng

Reputation: 1479

Pattern.split slower than String.split

There are two methods:

private static void normalSplit(String base){
    base.split("\\.");
}

private static final Pattern p = Pattern.compile("\\.");

private static void patternSplit(String base){
    //use the static field above
    p.split(base);

}

And I test them like this in the main method:

public static void main(String[] args) throws Exception{
    long start = System.currentTimeMillis();
    String longstr = "a.b.c.d.e.f.g.h.i.j";//use any long string you like
    for(int i=0;i<300000;i++){
        normalSplit(longstr);//switch to patternSplit to see the difference
    }
    System.out.println((System.currentTimeMillis()-start)/1000.0);
}

Intuitively,I think as String.split will eventually call Pattern.compile.split (after a lot of extra work) to do the real thing. I can construct the Pattern object in advance (it is thread safe) and speed up the splitting.

But the fact is, using the pre-constructed Pattern is much slower than calling String.split directly. I tried a 50-character-long string on them (using MyEclipse), the direct call consumes only half the time of using pre-constructed Pattern object.

Please can someone tell me why this happens ?

Upvotes: 9

Views: 1764

Answers (3)

nikis
nikis

Reputation: 11234

This is the change in String.split behaviour, which was made in Java 7. This is what we have in 7u40:

public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     (1)one-char String and this character is not one of the
        RegEx's meta characters ".$|()[{^?*+\\", or
     (2)two-char String and the first char is the backslash and
        the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        //do stuff
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}

And this is what we had in 6-b14

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}

Upvotes: 2

tobias_k
tobias_k

Reputation: 82889

This may depend on the actual implementation of Java. I'm using OpenJDK 7, and here, String.split does indeed invoke Pattern.compile(regex).split(this, limit), but only if the string to split by, regex, is more than a single character.

See here for the source code, line 2312.

public String[] split(String regex, int limit) {
   /* fastpath if the regex is a
      (1)one-char String and this character is not one of the
         RegEx's meta characters ".$|()[{^?*+\\", or
      (2)two-char String and the first char is the backslash and
         the second is not the ascii digit or ascii letter.
   */
   char ch = 0;
   if (((regex.count == 1 &&
       // a bunch of other checks and lots of low-level code
       return list.subList(0, resultSize).toArray(result);
   }
   return Pattern.compile(regex).split(this, limit);
}

As you are splitting by "\\.", it is using the "fast path". That is, if you are using OpenJDK.

Upvotes: 5

Evgeniy Dorofeev
Evgeniy Dorofeev

Reputation: 135992

I think this can only be explained by JIT optimization, String.split internally does is implemented as follows:

Pattern.compile(regex).split(this, limit);

and it works faster when it is inside String.class, but when I use the same code in the test:

    for (int i = 0; i < 300000; i++) {
        //base.split("\\.");// switch to patternSplit to see the difference
        //p.split(base);
        Pattern.compile("\\.").split(base, 0);
    }

I am getting the same result as p.split(base)

Upvotes: 0

Related Questions