How to implement floor() and ceil() operations for ternary search tries efficiently?

Question

I have been studying TST's (ternary search tries) from Robert Sedgewick's book, here's a link to his implementation: http://algs4.cs.princeton.edu/52trie/TST.java.html

So, since TST's are a modification of BST's, I wonder how to efficiently implement floor and ceiling operations. (They are not implemented anywhere in his code). All the approaches I have thought about are messy and not very efficient.

templatetypedef · Accepted Answer

Yes, you can efficiently implement these operations on a TST.

It might help for a minute to think of a TST as just a plain trie. We can work out how to perform predecessor and successor searches (what you're calling floor and ceiling) in a trie, then adapt those approaches to work in a TST. For simplicity I'm just going to talk about successor searches, though this can easily be adapted for precedessor searches as well.

Suppose you want to find the lexicographically first word that comes no later than some word w. Start by searching the trie for the word w. If you find that w is a word in the trie, you're done.

Otherwise, a few things can happen. First, you might find that you end up at some node in the trie corresponding to w that isn't a word. In that case, you know that w is a prefix of some word in the trie, so to find the successor you need to find the lexicographically first string that has w as a prefix. To do that, keep walking down the trie, always going as far to the left as possible, until you eventually hit a node that corresponds to a word.

Second, you might fall off the trie when trying to search for w. In that case, you'll have read some prefix of w along your path. In that case, you must have ended at some node where you were trying to read a character c, but there was no edge labeled c. In that case, look at the other edges from this node and find the first one with a character that comes after c. If one exists, take it, then find the lexicographically first word in that subtrie by always going as far to the left as possible. If not, back up one node in the trie and repeat this process.

To summarize, the recursive algorithm looks something like this:

function findSuccessor(root, remainingChars) {
    /* If we walked off the trie, we need to back up. Return null
     * to signal an error.
     */
    if (root == null) return null;

    /* If we're on the trie and out of characters, we're either done
     * or we need to find the cheapest way to extend this path.
     */
    if (remainingChars == "") {
        if (root is a word) {
            return root;
        } else {
            return goLeftUntilYouFindAWord(root);
        }
    }

    /* Otherwise, keep walking down the trie. */
    let nextLetter = remainingChars[0];

    /* If there is a child for this letter, follow it and see
     * what happens.
     */
    if (root.hasChildFor(nextLetter)) {
        let result = findSuccessor(root.child(nextLetter), nextLetter.substring(1));

        /* If we found something, great! We're done. */
        if (result != null) return result;
    }

    /* If we're here, we either (a) have no transition on this
     * character or (b) we do, but the successor isn't there. In
     * either case, figure out which child we have that comes right
     * after nextLetter and go down there if possible.
     */
    char letterAfter = node.firstChildAfter(nextLetter);

    /* If no such child exists, there is no successor in this
     * subtrie. Report failure.
     */
    if (letterAfter == null) return null;

    /* Otherwise, get the first word in that subtrie. */
    return goLeftUntilYouFindAWord(node.child(letterAfter));
}

So how exactly does this translate into the TST case? Well, we need to be able to check if a child exists - that's something we can do with a regular BST lookup - and we also need to be able to find the first character that comes after the character at a particular level - which we can do with a successor search in the BST. We also need to be able to find the first word in a subtree, which we can do by always walking to the left in the BST of child pointers.

Overall, the runtime here will be O(L log |Σ|), where L is the length of the longest string in the trie and Σ is the set of allowed characters. The reason for this is that in the worst case we have to descend all the way down the TST to find the successor, and each time we do so we do a constant number of BST operations, each of which takes time O(log |Σ|) because there are at most |Σ| child pointers for each node.

If you'd like to see a concrete implementation of this, I have a C++ implementation of a TST that implements lower_bound and upper_bound, which are closely related to the operations you're describing.

How to implement floor() and ceil() operations for ternary search tries efficiently?

Answers (1)

Related Questions