Reputation: 3497
I want to understand the performance issues that can emerge while making substring search in Java. I know the two built-in methods of searching for substring in Java.
1. String.indexOf()
As far as I understand this method uses the brute-force algorithm of substring search, thus its complexity is O(nm) where n and m are lengths of string and pattern.
2. Use Pattern and Matcher
I know nothing about how the regex algorithms are implemented and about their complexity.
So the questions are:
1) Which of these methods is preferrable from the perspective of performance?
2) What is the complexity of regex search? Does it depends on the regex itself?
Upvotes: 1
Views: 1514
Reputation: 14225
Honestly, if you care about worst-case performance, JNI into native code that calls your standard library's strstr
function. Well-implemented strstr
, like the one in recent versions of glibc, has linear worst-case running time and constant worst-case space usage. I believe glibc's strstr
can do Boyer-Moore-like long jumps through the text, too. C standard libraries are maintained by people who know how to write and maintain good and general-purpose libraries and practise their craft. The same cannot be said for the Java standard class library.
You will have to turn a Java UTF-16 string into something suitable for strstr
, such as a UTF-8 string. You will also have to handle embedded zero bytes in the UTF-8 string gracefully. Other than that, you will reap the benefits of a well-written and well-maintained library.
Java does regex searches (for this particular case) using a Boyer-Moore string search hacked into a naive regex implementation. Compiling a Pattern
with just your string will result in a Matcher
that performs relatively well. Note, however, that this does NOT extend to anything beyond string searching with the regex library; you're still stuck with a naive regex implementation that backtracks and all if you feed it a nontrivial regular expression.
As evidence for why you shouldn't use Java regex for actual regexes, I present you the following:
public class regex {
public static void main(String[] args) throws Exception {
String haystack = "ab";
String needle = "abab?.*";
for (int i = 0; i < 7; i++) haystack = haystack + haystack;
for (int i = 0; i < 4; i++) needle = needle + needle;
System.out.println(haystack.length() + " " + needle.length());
long before = System.currentTimeMillis();
System.out.println(Pattern.matches(needle, haystack));
long after = System.currentTimeMillis(); // long after indeed...
System.out.println(after - before);
}
}
This is a search in a 256-character haystack for a needle regex (that's an honest regex that you learnt about in compilers class) of 112 characters. It takes about 24 seconds to complete on my machine.
Upvotes: 1