richselian
richselian

Reputation: 731

Preprocess a constant set of strings for binary searching

I have a sorted list of several strings (size=K < 1000). I need to find insertion positions for billions (size=N) of strings in the sorted list. The list is kept constant and strings are inserted into children nodes.

The question is: I currently use binary search, which time cost is O(strlen * NlogK). But since the sorted list is constant. I wonder whether there is a preprocessing method on the small sorted list to make searching faster than logK?

Upvotes: 1

Views: 123

Answers (2)

Effect
Effect

Reputation: 85

In addition to Chris Okasaki I may suggest you calculate for each tree node (trie or patricia) number of leafs in corresponding subtree (you can easaly do it with depth first traversal).

To make a query with a string you go down by the tree and sum number of leafs (that are precalculated) you leave in subtrees that are left from your current position. When you stop at position and you cann't continue tree path without conflict with query string it means you find a position of this string. Index is a number of all left-placed leaves that was calculated with sum.

Upvotes: 1

Chris Okasaki
Chris Okasaki

Reputation: 4955

Some good alternatives include a Trie (perhaps implemented as a Patricia trie or a ternary search tree), or a perfect hash table.

EDIT: To find the "insertion position" for a non-matching string using a trie, first label each complete string with its position (you can do this when you originally build the trie). When searching for a non-matching string, you'll detect this at the first index within the string that has no match.

For example, suppose you were looking for the string CAR in a trie that included CANNOT and CATASTROPHE (and nothing else relevant). You would detect this mismatch at the R because where would be no R child below the A. But then it should be easy to tell that the surrounding letters in that position are N and T. Going to the N and then proceeding down and right would take you to CANNOT where you could read off the position. Or, going to the T and then proceeding down and left would take to you CATASTROPHE.

Upvotes: 2

Related Questions