Reputation:

Performance issues in Regular Expression

I am having my REST service which is running under heavy load, meaning it is getting lot of traffic around some million read calls per day. My REST servcie will do the lookup from the database basis on the userID and retrieve few bunch of columns corresponding to that userID.

So I am seeing high performance issues in my code currently. I am suspecting that below method will be one of the methods that I should start optimizing first of all.

Below method will accept an attributeName and then basis on that it will give me the match using the Regular Expression.

Let's take an example- If the attrName is technology.profile.financial

Then the below method will return me back as technology.profile. And this way it will work for other case as well.

private String getAttrDomain(String attrName){
    Pattern r = Pattern.compile(CommonConstants.VALID_DOMAIN);
    Matcher m = r.matcher(attrName.toLowerCase());
    if (m.find()) {
      return m.group(0);
    }
    return null;
}

In CommonConstants class file

String  VALID_DOMAIN = "(technology|computer|sdc|adj|wdc|pp|stub).(profile|preference|experience|behavioral)";

I am just trying to see, whether there might be some performance issues here or not using the regex above? If yes, then what's the best way to rewrite this thing again keeping in mind performance issues?

Upvotes: 2

Answers (3)

Eugene

Reputation: 121088

I used caliper to test this and this and the results are: if u compile the Pattern before every method call it is going to be the fastest way.

You regex method is the fastest wayto do it, BUT he only change you need to make is to compute the Pattern upfront, not every time:

 private static Pattern p = Pattern.compile(VALID_DOMAIN);

then in your method:

 Matcher matcher = pattern.matcher(input); ...

For the ones interested this is the settings I used for caliper: --warmupMillis 10000 --runMillis 100

 package stackoverflow;

 import java.util.regex.Matcher;
 import java.util.regex.Pattern;

 import com.google.caliper.Param;
 import com.google.caliper.Runner;
 import com.google.caliper.SimpleBenchmark;
 import com.google.common.base.Splitter;
 import com.google.common.collect.Iterables;

 public class RegexPerformance extends SimpleBenchmark {
      private static final String firstPart    = "technology|computer|sdc|adj|wdc|pp|stub";
      private static final String secondPart   = "profile|preference|experience|behavioral";
      private static final String VALID_DOMAIN = "(technology|computer|sdc|adj|wdc|pp|stub)\\.(profile|preference|experience|behavioral)";

      @Param({"technology.profile.financial", "computer.preference.test","sdc.experience.test"})
      private String input;

      public static void main(String[] args) {
           Runner.main(RegexPerformance.class, args);
      }

      public void timeRegexMatch(int reps){
          for(int i=0;i<reps;++i){
              regexMatch(input);
          }
      }


      public void timeGuavaMatch(int reps){
          for(int i=0;i<reps;++i){
              guavaMatch(input);
          }
      }

      public void timeRegexMatchOutsideMethod(int reps){
          for(int i=0;i<reps;++i){
              regexMatchOutsideMethod(input);
          }
      }


    public String regexMatch(String input){
        Pattern p = Pattern.compile(VALID_DOMAIN);
        Matcher m = p.matcher(input);
        if(m.find()) return m.group();
        return null;
    }

    public String regexMatchOutsideMethod(String input){
          Matcher matcher = pattern.matcher(input);
          if(matcher.find()) return matcher.group();
          return null;
    }

    public String guavaMatch(String input){
        Iterable<String> tokens = Splitter.on(".").omitEmptyStrings().split(input);
        String firstToken  = Iterables.get(tokens, 0);
        String secondToken = Iterables.get(tokens, 1);
        if( (firstPart.contains(firstToken) ) && (secondPart.contains(secondToken)) ){
            return firstToken+"."+secondToken;
        }
        return null;
    }
}

And the results of the test:

             RegexMatch technology.profile.financial 2980 ========================
             RegexMatch     computer.preference.test 2861 =======================
            RegexMatch           sdc.experience.test 3683 ==============================
RegexMatchOutsideMethod technology.profile.financial  179 =
RegexMatchOutsideMethod     computer.preference.test  227 =
RegexMatchOutsideMethod           sdc.experience.test  987 ========
             GuavaMatch technology.profile.financial  406 ===
             GuavaMatch     computer.preference.test  421 ===
            GuavaMatch           sdc.experience.test  382 ===

Upvotes: 3

Alan Moore

Reputation: 75272

Is there any reason why you can't save the regex as a Pattern ratter than as a string? If the regex never changes, you're wasting a lot of time recompiling the regex every time you use it. For such a simple pattern, compiling the regex probably takes a lot more time than actually matching it.

As for the regex itself, there are some changes I would recommend. These changes will make the regex slightly more efficient, but it probably won't be enough to notice. The purpose is to make it more robust.

Enclose it in word boundaries to avoid false positives on strings like foo_technology.profile or technology.profile_bar. I'm sure you know that kind of thing will happen in your case, but why take even the smallest risk when it's so easy to avoid?
Escape the dot, as @plaix suggested.
Use non-capturing groups instead of capturing. (Assuming you don't really need to break out the individual components of the attribute name.)

static final Pattern VALID_DOMAIN_PATTERN = Pattern.compile(
    "\\b(?:technology|computer|sdc|adj|wdc|pp|stub)\\.(?:profile|preference|experience|behavioral)\\b");

Upvotes: 2

MikeM

Reputation: 13641

Two small points:

As well as compiling the expression outside of the function, as mentioned in the comments, you could make the () non-capturing so the content matched by each is not saved, i.e.

String  VALID_DOMAIN = "(?:technology|computer|sdc|adj|wdc|pp|stub)\\.(?:profile|preference|experience|behavioral)";

and if the valid domain must always appear at the beginning of the attribute name you could perhaps use the lookingAt method instead of find, so the the match can fail quicker, i.e.

 if (m.lookingAt()) {

And if the expression was compiled outside of the function, you could add Pattern.CASE_INSENSITIVE so then you wouldn't have to call toLowerCase() on the attrName each time.

Upvotes: 2

Performance issues in Regular Expression

Answers (3)

Related Questions