Reputation: 155205

Substring indexing many similar strings

I have a large collection of data, in this case imagine a 80,000+ array of String all containing file paths.

Being filepaths it means large groups of them start off with the same path, e.g. I over 50,000 of the files start off with "/dataset1/subsetAA/childX/".

I want to allow freetext searching of these paths. Right now I do that with a simple predicate that looks like this:

foreach(String term in terms)
    if( path.IndexOf( term, StringComparison.OrdinalIgnoreCase ) == -1 )
        return false;
return true;

I do save search results as they're typed in, so the more you type in the quicker it gets, however the initial few searches (e.g. for "f" > "fo" > "foo") can take up to 3 or 4 seconds on even a fast machine.

I'd like to build a substring index up that eliminates my need to use IndexOf, and preferably one that takes advantage of common paths to reduce index size, I don't want to consume too much memory.

Upvotes: 1

Answers (2)

Sergey Kalinichenko

Reputation: 726589

however the initial few searches (e.g. for "f" > "fo" > "foo") can take up to 3 or 4 seconds on even a fast machine.

That's the only thing that you need to optimize then. Create a very simple structure that consists of three hash sets - for single characters, for two characters, and for three characters. Each element of the one-character hash index would contain a list of elements that include the indexed character; each element of the two-character hash index would contain a list of elements that include the indexed pair of characters; three-character index would do likewise.

When the initial portion of the search is typed, look up using indexes. For example, when f is typed, you would grab the list of items containing f from the first hash table. As the user continues typing, you'd grab items from the second index for the "fo" key, and then from the third index for the "foo" key.

As soon as you get four characters or more, you go back to the searches based on IndexOf, using the last three characters of the search term to look up the initial list in the hash based on three-character substrings. The number of items that you get from the list would be relatively small, so the searches should go much faster.

Another optimization should be stopping your search as soon as you've got enough items to display to your user. For example, if the user types "tas" (from "dataset") your three-character index would give you 50000 hits. Grab the first 20 (or as many as you need to display), and skip the remaining ones: the users will refine their search shortly, so the additional items will likely be discarded shortly anyway.

Upvotes: 2

Patashu

Reputation: 21773

read about the data structure known as a Trie: http://en.wikipedia.org/wiki/Trie

It does exactly what you want, it takes many strings and builds a tree out of common prefixes, with strings, each leaf being a string that follows the series of prefixes in its parents (that you can build by concatenating all of its parents to what's in the leaf, to save space)

Upvotes: 2

Substring indexing many similar strings

Answers (2)

Related Questions