Yrlec
Yrlec

Reputation: 3458

Finding if a string contains any string in a collection

I'm trying to improve the performance of a Java-function I have that is determining whether a given search string contains >0 of the strings in a collection. This could look like premature optimization but the function is called A LOT so any speed up would be very beneficial.

The code currently looks like this:

public static boolean containsAny(String searchString, List<String> searchCollection) {
    int size = searchCollection.size();
    for (int i = 0; i < size; i++) {
        String stringInCollection = searchCollection.get(i);
        if (!Util.isNullOrEmpty(stringInCollection)) {
            // This is a performance optimization of contains.
            if (searchString.indexOf(stringInCollection, 0) > -1) {
                return true;
            }
        }
    }
    return false;
}

The list typically has about 30 elements and the same collection is reused a lot between each call.

The code above is a pretty straightforward linear search. I don't think it can be significantly improved unless we change the data structure to make it better than O(n). Are there any data structures that would enable me to do that?

Upvotes: 18

Views: 4912

Answers (11)

Ted Bigham
Ted Bigham

Reputation: 4341

As many other people have answered, there are better data structures in general for storing and searching strings. The problem in your case is that your list only has 30 entries. The overhead added by using a more complex data structure and algorithm could easily outweigh the gain you would get from it.

Don't get me wrong, your bottleneck is the indexOf line. It looks like it accounts for 95% of the processing. But if other data structures don't help (I tried an off-the-shelf Aho-Corasick Trie and it was twice as slow), here's a something to check...

The comment about using indexOf instead of contains is questionable. In my tests. I saw around 1.5 million lookups per second with "contains" and only about 700K with indexOf. If you have the same results, that will double your speed right there.

Change

// This is a performance optimization of contains.
if (searchString.indexOf(stringInCollection, 0) > -1) {

[back] to

if (searchString.contains(stringInCollection)) {

If you're interested, the trie I tested with is here: http://ahocorasick.org/ and the code is quite simple. The problem I saw is that is doesn't have a feature for early-out after finding the first match. It parses the whole string and finds all matches. It was faster than indexOf() for cases where there were no matches (830K/sec) but still slower than contains().

Apparently http://ahocorasick.org/ is gone.

Very similar code (possibly the same) can be found at https://github.com/robert-bor/aho-corasick

Upvotes: 2

Manjunath
Manjunath

Reputation: 1685

@Yrlec from your comment that the searchCollection can be thought of as constant with not much of modifications , you can sort the arraylist and cache it or you can implement custom List class which stores a reference to the sorted elements which are added to it.

The reason for this is if you have your searchCollection sorted then you can use compareTo method of String and reduce the number of iterations thus enhancing your method performance to an extent.

public static boolean containsAny(String searchString, List<String> searchCollectionSorted) {
    int size = searchCollectionSorted.size();
    for (int i = 0; i < size; i++) {
        String stringInCollection = searchCollectionSorted.get(i);
        if (!Util.isNullOrEmpty(stringInCollection)) {
            if (stringInCollection.compareToIgnoreCase(searchString) > 0) {
                if (searchString.startsWith(stringInCollection) {
                    return true;
                } else {
                    // No point of iterating if we reach here as the searchstring is greater and hence iterations are saved improving performance
                    break;
                }
            }
        }
    }
    return false;
}

Upvotes: 1

kraskevich
kraskevich

Reputation: 18556

It is possible to speed it up significantly with Aho-Corasick algorithm.

You can build an Aho-Corasick automaton for a collection using O(total length of all strings in a collections) time and space. Then it will be possible to check if one of the strings in a collection is a substring of a given string S in O(S.length) time by traversing this automaton.

Upvotes: 16

Gladwin Burboz
Gladwin Burboz

Reputation: 3549

This is a CPU intensive operation and not long running or blocked on I/O. If you are using Java 8 you can use parallel streams to do processing in parallel as shown below. The method has been changed to use Collection instead of List to keep it more flexible.

public static boolean containsAny(final String searchString,
        final Collection<String> searchCollection) {
    return searchCollection.stream().parallel()
            .anyMatch(x -> searchString.indexOf(x) > -1);
}

Furthermore, instead of using List, a Set should be used as the underlying data structure so that duplicate entries, if any, will be eliminated.

Upvotes: 8

Ryan
Ryan

Reputation: 2091

You can complete your search in approximately 2/3 the time by using the Aho Corasick algorithm.

The accepted answer from @user2040251 among others (myself included) suggested the Aho Corasick algorithm.

From your comments I can see that you aren't looking for a general solution but a solution that performs well in a particular use-case.

@Vlad created a possible test suite to benchmark some proposed solutions.

Tests performed by @Marco13 of the Java implementation at http://ahocorasick.org/ indicate that your initial implementation was faster.

Your comments have provided significant additional details about the problem you are trying to solve:

  • Approximately 30 strings to search for
  • The strings to look for are 10 - 40 characters long.
  • The string to search in is usually about 100 characters long.
  • The string you are searching is a file path.

I made a couple of quick modifications to @Vlad's gist to better match the specifics of the problem you described.

I'd previously commented that the Aho-Corasick implementation others had tested was finding all potential matches. A method that returned once the first match was found should be much faster. To see if my intuition was correct I created a branch of Robert Bor's java Aho-Corasick implementation. This branch has now been merged into Aho-Corasick!

  • Completed 100000 containsAny in 4337 ms (avg 0 ms)
  • Completed 100000 containsAnyWithRegex in 41153 ms (avg 0 ms)
  • Completed 100000 containsAnyWithOffset in 23624 ms (avg 0 ms)
  • Completed 100000 containsAnyAhoCorasickDotOrg in 7956 ms (avg 0 ms)
  • Completed 100000 containsAnyAhoCorasickDotOrgMatches in 5351 ms (avg 0 ms)
  • Completed 100000 containsAnyAhoCorasickDYoo in 2948 ms (avg 0 ms)
  • Completed 100000 containsAnyHospool in 7052 ms (avg 0 ms)
  • Completed 100000 containsAnyRaita in 5397 ms (avg 0 ms)
  • Completed 100000 containsAnyJava8StreamParallel in 8285 ms (avg 0 ms)

I also implemented a method that performed each search in its own thread. That implementation was horrible and performed approximately 10x slower.

Update: Since my initial testing I've come across An even faster Aho-Corasick implementation.

I included a benchmark on the Java 8 parallel stream implementation suggested by @GladwinB as well as two com.eaio.stringsearch implementations.

There may still be gains to be had. This paper, for example, describes an set matching variation of Aho-Corasick that sounds appropriate for your problem.Towards Faster String Matching for Intrusion Detection

Upvotes: 3

ig-melnyk
ig-melnyk

Reputation: 2879

TreeSet,HashSet or PrefixTree are quite good solutions. You should prefer PrefixTree if you 'll need to search whether a given prefix exists in the collection(complexity O(length(S)),otherwise use HashSet. http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html

Upvotes: 0

Vlad Muresan
Vlad Muresan

Reputation: 49

You can use the HashSet data structure. But a hash set will not allow duplicates. For example, you cannot have the string "foo" twice in the HashSet.

On the plus side, the complexity should be O(1).

http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html

Upvotes: 1

Nikola Dimitrovski
Nikola Dimitrovski

Reputation: 708

Can you try with this solution:

    final String[] searchList = searchCollection.toArray(new String[0]);
    Arrays.sort(searchList, new Comparator<String>() {
        @Override
        public int compare(final String o1, final String o2) {
            if (o1 == null && o2 == null) {
                return 0;
            }
            if (o1 == null || o1.isEmpty()) {
                return 1;
            }
            if (o2 == null || o2.isEmpty()) {
                return -1;
            }
            return o1.compareTo(o2);
        }
    });
    final int result = Arrays.binarySearch(searchList, searchString);
    return result >= 0 ? true : false;

Upvotes: 2

Joop Eggen
Joop Eggen

Reputation: 109557

// Make a regex pattern (once only):
StringBuilder pattern = new StringBuilder();
for (String sought : searchCollection) {
    if (!Util.isNullOrEmpty(sought)) {
        if (pattern.length() != 0) {
            pattern.append('|');
        }
        pattern.append(Pattern.quote(sought));
    }
}
final Pattern PATTERN = Pattern.compile("(" + pattern + ")");

This creates a pattern of alternatives like "(abc|def|ghi)". You might consider a case-insensitive search.

And in the function containsAny:

Matcher m = PATTERN.matcher(searchString);
return m.find();

Regex compilation is relatively smart. It would be comparable to using a search tree of your collection of sought words: "agent" and "agitator" to ("ag", ("ent", "itator"))

Upvotes: 9

ethanfar
ethanfar

Reputation: 3778

I believe the best suited data structure for this is a Suffix Tree. For a string of size n, building the tree takes Theta(n), and searching a sub-string of length m in it, takes O(m).

This is one of those data structures that are extremely well suited (and intended) for searching strings. It's a very common data structure with many implementations online.

Upvotes: 3

Vlad
Vlad

Reputation: 1763

Compare with this a sort of inverted and optimised version:

  public static boolean containsAny(String searchString, List<String> searchCollection) {
    for (int offset = 0; offset < searchString.length(); offset++) {
      for (String sought: searchCollection) {
        int remainder = searchString.length() - offset;
        if (remainder >= sought.length && searchString.startsWith(sought, offset)) {
          return true;
        }
      }
    }
    return false;
  }

Note the usage of startsWith with offset.

Upvotes: 2

Related Questions