Reputation: 392
I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))
) find the occurrences of any subset of the text.
For instance, if the input is book
it should find all occurrences of it, but this should also be true for book is
and The book is
.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%'
is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
Upvotes: 3
Views: 2887
Reputation: 11
if you want fast substring search in a large text, use a suffix array or a suffix tree. suffix tree is a compressed trie of all suffixes of a given text. It allows for fast substring searches, typically in O(m) time, where m is the length of the substring being searched. in suffix array, a sorted array of all suffixes of the text, efficient for substring search with O(m+log n) time and can be stored on disk, making it scalable for large texts. A suffix array is an array of integers giving the starting positions of suffixes of a string, sorted in lexicographical order. It's more space-efficient than a suffix tree. in suffix tree, a compressed trie structure of all suffixes, allows O(m) substring search and more memory-intensive but faster than a suffix array.
Upvotes: -1
Reputation: 178511
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s
, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i
in the query, its positions (let it be p
), and if term i+1
of your query, appears as well in position p+1
.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.
Upvotes: 4