RonPringadi
RonPringadi

Reputation: 1504

Speed optimization java string contains versus regex match

How do I deliver the best performance(speed) to check if a sentence contains any keyword1, keyword2, keyword, etc.

Here are my options:

  1. Use String.contains: if(string.contains(item1)||string.contains(item2)||string.contains(item3))
  2. Or build a for loop for option #1 before the if-or-or-or above becomes out of control.
  3. Use regex
  4. Another option is using Java 8 Streaming API Which is not available for me at the moment. The client uses Java 7

Upvotes: 0

Views: 1457

Answers (2)

Michael Bar-Sinai
Michael Bar-Sinai

Reputation: 2739

First off, every answer should be subject to testing in production conditions. When performance is the question, RAM and cache sizes, bus speeds etc. come into play and make things hard to predict. Another question is how many times will this code run - the JVM will initially run an interpreted version of it, and only after the code has been executed enough times will replace that with a compiled (and faster) version.

Having said that, here are some pointers:

  • If you have a lot of keywords, consider parallelizing the task. Use an executor, or a parallel stream. That would only work for about 100+ keywords, and make your code slower for smaller amounts of keywords.
  • If the keywords are used often enough, try using some algorithm to search all of them, e.g. using a prefix-tree (a.k.a. trie). Note that these structures can cause inefficient memory use, as the node objects may be scattered in memory and thus cause cache misses while being traversed. This is why an ArrayList is faster than a LinkedList in practice, even though they have the similar properties in theory.
  • Try switching to byte arrays (i.e. using String.getBytes), and then using the methods of Arrays class to find each word. This has the advantage of memory locality. Note that Unicode may be tricky here, so you may need to normalize it first.

But above all, test. Just make sure you're doing your micro-benchmarks properly.

Upvotes: 2

Stéphane GRILLON
Stéphane GRILLON

Reputation: 11884

I advise you to use regexp because it is very simple and powerful

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "STRING1|STRING2|STRING3";
final String string = "xxxSTRING1xxxSTRING2xxx";

final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

STDOUT:

Full match: STRING1
Full match: STRING2

DEMO in online IDE: here

Upvotes: 0

Related Questions