Reputation: 30166

Fast algorithm for searching for substrings in a string

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - INSTR:

"BCDEFGH"

And a set of candidate strings - CAND:

"AB", "CDE", "FG", "H", "IJ"

Find any CAND strings that match as substrings within INSTR

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

Update I only need to know that there was a match - and i don't need to know what the match was.

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick: 1

RabinKarp: 1.8

Naive Search (check each pattern & use string.contains): 50

*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Upvotes: 32

Answers (10)

Daniel Brückner

Reputation: 59705

Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.

Implementations:

Presentations:

http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438

Upvotes: 26

Deepak Kumar

Reputation: 1

import java.util.Scanner;

public class StringMatch 
{
    static int temp,i=0,j=0; static boolean flag=true,matcher=false;

    static String str=null,mstr=null;static char astr[],amstr[];

    static void getter(){
        Scanner sc = new Scanner(System.in);
        str = sc.nextLine();
        //String str="today is Monday"; 
        astr=str.toCharArray();
         mstr = sc.nextLine();
        //String mstr="is"; 
         amstr=mstr.toCharArray();
    }

    static void stringMatch(){
        while(i<astr.length){
            if(astr[i]==amstr[j]){
            while((j!=amstr.length)&&flag){temp=i;
                if(astr[i]!=amstr[j]) {flag=false;matcher=false;}
                else{matcher=true;}
                i++;j++;
                //System.out.println(i+"\t"+j);
            }if(matcher==true)break;i=temp;}i++;j=0;flag=true;

        }
        if(matcher==true) {System.out.println("true");}
        else    {System.out.println("false");}
    }

    public static void main(String[] args) {

    StringMatch.getter();
    StringMatch.stringMatch();

    }
}

Upvotes: 0

netcyrax

Reputation: 1099

Here are some implementation of fast String search algorithms in Java.

Upvotes: 0

Joy Dutta

Reputation: 3426

We can take advantage of the small size (< 50 char) of the strings to build a super fast algo for this case, at the cost of memory.

We can hash all possible substrings of INSTR in a hash one time that will cost O(n^2) time. Then regardless of the number of CAND strings, the lookup will be O(1). Worth it for a very large number of CAND strings.

If INSTR is large, then we can build a suffix array and not sort it, so that the top item is the longest (=N) and bottom item is the last char of INSTR. Now for each CAND string, only search from the top as long as length(CAND) <= length(suffix). Each of those comparisons will be O(n).

Upvotes: 2

Nick Dandoulakis

Reputation: 43158

Another solution is to use a suffix array for the INSTR.
Since the INSTR is small you can sort it with bubble sort.

Afterwards you can search for a specific CAND string in O(logN) time,
where N = length(suffix_array) = length(INSTR).

Upvotes: 0

Jørgen Fogh

Reputation: 7656

This is what regular expressions are for. As noted above, finite state automata are what you need, but that is exactly how a standard regexp-matcher is implemented.

In java you could write something like:

StringBuilder sb = new StringBuilder();
bool first = true;
for (String subStr : substrings) {
    if (first)
        first = false;
    else
        sb.append('|');
    sb.append(escape(subStr));
}
Pattern p = Pattern.compile(sb.toString());

the method escape should escape any characters which have special meanings in a regexp.

Upvotes: 5

spoulson

Reputation: 21601

Also check the Boyer-Moore algorithm for single-string pattern matching.

Upvotes: 2

emptyset

Reputation: 3244

Rabin-Karp multiple pattern search appears to be the fastest.

Upvotes: 4

Avi

Reputation: 20152

You might want to look into Aho-Corasick algorithm and related algorithms. I don't know of any libraries that implement this, offhand, but this is the classic way of solving this problem.

Upvotes: 2

Antti Huima

Reputation: 25542

Convert the set of candidate strings into a deterministic finite state automaton and then run through the input string in linear time. Converting a single string into a DFS is well-covered in the standard books. You can convert a set of strings by first constructing a non-deterministic automaton and then determinizing it. That can create exponential blow-up in the worst case in the size of the automaton but the search afterwards is fast; especially if the target string is long and the candidates short that's going to work well.

Upvotes: 11

Fast algorithm for searching for substrings in a string

Answers (10)

Related Questions