Tamojit Chatterjee
Tamojit Chatterjee

Reputation: 161

data structure for probable string matches

what is the best data structure for the following operation :
the data structure stores a list of words
input : a string say we name it 'pre'
output : a list of all strings that have the pre as their prefix(from the list of stored words) and the words in the list should be ordered in decreasing order of their priority.
priority of a particular string is increased if it is used out of the list of strings returned as output.
i will be using this for word prediction so every time the user selects a certain word(from the list of returned words) it's priority is increased by 1.
I have already implemented a trie but it gives the output(list) in alphabetical order i want it sorted by priority.

Upvotes: 3

Views: 616

Answers (4)

Shriganesh Shintre
Shriganesh Shintre

Reputation: 2498

This is memory inefficient solution.

Reference: Trie example

At each node store all the possible valid suffixes which have prefix as so far traversed parents. Also for fast retrieval use max heap based on priority to store these suffixes.

For example your c++ trie node will look like this (I have not tested the code)

typedef pair<int, string> SUFFIX;
class Compare {
public:
    bool operator() (SUFFIX &d1, SUFFIX &d2) {
        return d1.first < d2.first;
    }
};
typedef priority_queue<SUFFIX, vector<SUFFIX>, Compare> max_heap;

struct TrieNode {
    char data; // char at current node
    max_heap word_suffixes; 
    bool is_complete;
};

/* Note: max_heap word_suffixes basically hold all strings without prefix so far. 
For example: You have dictionary of egg, eye at the starting node "e" your max
heap will have two entries "gg" and "ye" (with highest priority say "gg" 
as root of max heap) 
*/

Now the time complexity will be

1) traverse the trie as per prefix "pre" O(L) (L=len of pre)

2) at the node pop each string from max heap, which will give you list sorted with priority. O(nlogn) (n=size of heap)

3) reconstruct the heap after incrementing the priority of used word. O(nlogn)

Note: You can also try storing the suffixes as BST with priority as key. In-order traversal will give you priority-sorted list of suffixes (O(n)). Used word's priority can be incremented by removing the suffix from BST and re-adding with new priority (Insert/search/delete will be O(height) for balanced BST).

Upvotes: 0

user2566092
user2566092

Reputation: 4661

As other answers indicated, you can use a trie to quickly get all words that have a given prefix, and then sort the words according to their priority. Ignoring the access time to get the matching words from the trie, if you get k matching words then sorting according to priority takes O(k log k) time. This is so close to the theoretically optimal O(k) time that you probably wouldn't want to bother trying to improve upon it for practical applications, especially since printing the k words after sorting actually has running time O(kl) where l is he average length of the matching words, and the multiplier of l will probably usually be roughly on the same order as log k. However, if you are willing to multiply the amount of space you use by O(L_avg) where L_avg is the average length of all your words, then you can get the time for accessing words in sorted order and updating a priority +1 down to O(k + L log n) where L is the length of the chosen word that gets priority +1, and n is your total number of words.

The idea may sound a little crazy at first but bear with me, the memory really only gets multiplied by O(L_avg) as I'll explain. The idea is that at each node of your trie, you store all the words that have the corresponding prefix, along with their priorities, in a self-balancing binary search tree (ordered according to priority). You can represent words as indexes into an array storing your words, instead of the full words, so the storage requirement at each node of your trie is linear in the number of words that have the corresponding prefix. When a word gets priority +1, you have to go up through the trie and update the balanced binary search trees for the corresponding trie node for the word and all its parent nodes, which takes O(L log n) time. However to get the indices of the words in sorted order in response to a query, all you have to do is traverse your binary tree in pre-order, which takes O(k) time. Now about the storage. A word of length L is stored in L binary trees: the tree at the trie node for your word, and also the trees for all its L-1 parent nodes. So if you add up the total storage for all trees at all nodes in the trie, by counting in terms of how many times each word occurs in your trees, the total tree storage is linear in the total length of all your words, which is O(n L_avg). If you can handle that multiplier on storage, then I believe this is the theoretically fastest way to handle the queries and priority changes, if you really want to remove the log k multiplier you get by sorting query results.

Upvotes: 0

gravitas
gravitas

Reputation: 703

This is probably not the best solution but maybe it will give you some ideas.

Use a trie to store all words, and have your nodes include a priority field.

Have a list data-structure of some sort that your trie can see that has the same scope as your querying function. The list will contain (word, priority) entries.

Iterate the tree beneath the input word (so the subtree underneath 'pre') and look for all words (presumably the nodes have a boolean 'word' field or something). When a word is found (word == 1), you will then add the (word, priority) to the end of the list.

Suppose i is the position of the new entry, then compare list(i) with list(i - 1). If the priority of list(i - 1) is less than list(i) then you switch their positions. Keep doing this until the i-1th item has equal or greater priority than the newly added item.

Once the trie-searching function completes, you will have a list with (word, priority) entries in decreasing order.

I hope this made sense.

Upvotes: 1

Sabashan Ragavan
Sabashan Ragavan

Reputation: 738

The best data structure for your problem would be a trie A trie would allow for fast lookups at the expense of space.

Follow this link for more information: link

Upvotes: 4

Related Questions