345871345
345871345

Reputation: 261

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.

Upvotes: 5

Views: 297

Answers (2)

Yuval F
Yuval F

Reputation: 20621

Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.

Upvotes: 1

Mikos
Mikos

Reputation: 8553

Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.

The Simmetrics library is a comprehensive collection of string distance approaches.

Upvotes: 1

Related Questions