Vitas
Vitas

Reputation: 253

Similar substring fast search

I need to find a substring SIMILAR to a given pattern in a huge string. The source huge string could be up to 100 Mb in length. The pattern is rather short (10-100 chars). The problem is that I need to find not only exact substrings but also similar substrings that differ from the pattern in several chars (the maximum allowed error count is provided as a parameter).

Is there any idea how to speed up the algorithm?

Upvotes: 3

Views: 2067

Answers (3)

Ahmed Salama
Ahmed Salama

Reputation: 11

1)There are many Algorithms related to the string searching. One of them is the famous Knuth–Morris–Pratt Algorithm.

2) You may also want to check Regular Expressions ("Regex") in whatever the language you're using. They will sure help you finding substrings 'similar' to the original ones.

i.e. [Java]

String pat = "Home";
String source = "IgotanewHwme";

for(int i = 0; i < pat.length(); i++){
    //split around i .. not including char i itself .. instead, replace it with [a-zA-Z] and match using this new pattern.
    String new_pat = "("+pat.substring(0, i)+")"+ "[a-zA-Z]" + "("+pat.substring(i+1, pat.length())+")";
    System.out.println(new_pat);
    System.out.println(source.matches("[a-zA-Z]*"+new_pat+"[a-zA-Z]*"));
}

and I think it's easy to make it accept any number of error counts.

Upvotes: 1

ThibThib
ThibThib

Reputation: 8210

You can have a look at the Levenshtein distance, the Needleman–Wunsch algorithm and the Damerau–Levenshtein distance

They give you metrics evaluating the amount of differences between two strings (ie the numbers of addition, deletion, substitution, etc.). They are often used to measure the variation between DNA.

You will easily find implementations in various languages.

Upvotes: 0

Asgeir
Asgeir

Reputation: 1112

Sounds like you want Fuzzy/Approximate String Matching. Have a look at the Wikipedia page and see if you can't find an algorithm that suits your needs.

Upvotes: 0

Related Questions