Reputation:
Hello fellow stackoverflowers!
I have a word list of 200.000 string entries, average string length is around 30 characters. This list of words are the key and to each key i have a domain object. I would like to find the domain objects in this collection by only knowing a part of the key. I.E. the search string "kov" would for example match the key "stackoverflow".
Currently I am using a Ternary Search Tree (TST), which usually will find the items within 100 milliseconds. This is however too slow for my requirements. The TST implementation could be improved with some minor optimizations and I could try to balance the tree. But i figured that these things would not give me the 5x - 10x speed improvement I am aiming at. I am assuming that the reason for being so slow is that i basically have to visit most nodes in the tree.
Any ideas on how to improve the speed of the algorithm? Are there any other algorithms that I should be looking at?
Thanks in advance, Oskar
Upvotes: 13
Views: 6059
Reputation: 233
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.
Edit Distance ED(x,y): minimal number of transfroms to get from x to y
But computing ED between each term and query text is resource and time consuming. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.
An advantage of Qgram index technique is it supports for Fuzzy Search.
One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram(Instead of storing full string you can use unique ID for each string).
col : colmbia, colombo, gancola, tacolama
Then when querying, we calculate the number of common Qgrams between query text and available terms.
Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4
For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.
you can find an implementation of this theory in following project. Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search
To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
Upvotes: 0
Reputation: 545628
/EDIT: A friend of mine just pointed out a stupid assumption in my construction of the q-gram table. The construction can be made much simpler – and consequently, much faster. I've edited the source code and explanation to reflect this. I think it might be the final solution.
Inspired by Rafał Dowgird's comment to my previous answer, I've updated my code. I think this merits an own answer however, since it's also quite long. Instead of padding the existing strings, this code builds the index over the original array of strings. Instead of storing a single position, the suffix array stores a pair: the index of the target string and the position of the suffix in that string. In the result, only the first number is needed. However, the second number is necessary for the construction of the q-gram table.
The new version of the algorithm builds the q-gram table by walking over the suffix array instead of the original strings. This saves the binary search of the suffix array. Consequently, the runtime of the construction drops from O(n * log n) down to O(n) (where n is the size of the suffix array).
Notice that, like my first solution, use of SubString
results in a lot of unnecessary copies. The obvious solution is to write an extension method that creates a lightweight wrapper instead of copying the string. The comparison then has to be slightly adapted. This is left as an exercise for the reader. ;-)
using Position = System.Collections.Generic.KeyValuePair<int, int>;
class QGramIndex {
private readonly int m_Q;
private readonly IList<string> m_Data;
private Position[] m_SA;
private Dictionary<string, int> m_Dir;
public QGramIndex(IList<string> strings, int q) {
m_Q = q;
m_Data = strings;
MakeSuffixArray();
MakeIndex();
}
public int this[string s] { get { return FindInIndex(s); } }
private int FindInIndex(string s) {
int idx;
if (!m_Dir.TryGetValue(s, out idx))
return -1;
return m_SA[idx].Key;
}
private void MakeSuffixArray() {
int size = m_Data.Sum(str => str.Length < m_Q ? 0 : str.Length - m_Q + 1);
m_SA = new Position[size];
int pos = 0;
for (int i = 0; i < m_Data.Count; ++i)
for (int j = 0; j <= m_Data[i].Length - m_Q; ++j)
m_SA[pos++] = new Position(i, j);
Array.Sort(
m_SA,
(x, y) => string.CompareOrdinal(
m_Data[x.Key].Substring(x.Value),
m_Data[y.Key].Substring(y.Value)
)
);
}
private void MakeIndex() {
m_Dir = new Dictionary<string, int>(m_SA.Length);
// Every q-gram is a prefix in the suffix table.
for (int i = 0; i < m_SA.Length; ++i) {
var pos = m_SA[i];
m_Dir[m_Data[pos.Key].Substring(pos.Value, 5)] = i;
}
}
}
Usage is the same as in the other example, minus the required maxlen
argument for the constructor.
Upvotes: 0
Reputation: 545628
If your strings have a strict upper bound on the size you might consider the use of a suffix array: Simply pad all your strings to the same maximum length using a special character (e.g. the null char). Then concatenate all strings and build a suffix array index over them.
This gives you a lookup runtime of m * log n where m is the length of your query string and n is the overall length of your combined strings. If this still isn't good enough and your m has a fixed, small length, and your alphabet Σ is restricted in size (say, Σ < 128 different characters) you can additionally build a q-gram index. This will allow retrieval in constant time. However, the q-gram table requires Σm entries (= 8 MiB in the case of just 3 characters, and 1 GiB for 4 characters!).
It might be possible to reduce the size of the q-gram table (exponentially, in the best case) by adjusting the hash function. Instead of assigning a unique number to every possible q-gram you might employ a lossy hash function. The table then would have to store lists of possible suffix array indices instead of just one suffix array entry corresponding to an exact match. This would entail that lookup is no longer constant, though, because all entries in the list would have to be considered.
By the way, I'm not sure if you're familiar with how a q-gram index works since the Internet isn't helpful on this topic. I've mentioned this before in another topic. I've therefore included a description and an algorithm for the construction in my bachelor thesis.
I've written a very small C# proof of concept (since you stated otherwise that you worked with C#). It works, however it is very slow for two reasons. First, the suffix array creation simply sorts the suffixes. This alone has runtime n2 log n. There are far superior methods. Worse, however, is the fact that I use SubString
to obtain the suffixes. Unfortunately, .NET creates copies of the whole suffix for this. To use this code in practice, make sure that you use in-place methods which do not copy any data around unnecessarily. The same is true for retrieving the q-grams from the string.
It would possibly even better to not construct the m_Data
string used in my example. Instead, you could save a reference to the original array and simulate all my SubString
accesses by working on this array.
Still, it's easy to see that this implementation has essentially expected constant time retrieval (if the dictionary is well-behaved)! This is quite an achievement that can't possibly be beaten by a search tree/trie!
class QGramIndex {
private readonly int m_Maxlen;
private readonly string m_Data;
private readonly int m_Q;
private int[] m_SA;
private Dictionary<string, int> m_Dir = new Dictionary<string, int>();
private struct StrCmp : IComparer<int> {
public readonly String Data;
public StrCmp(string data) { Data = data; }
public int Compare(int x, int y) {
return string.CompareOrdinal(Data.Substring(x), Data.Substring(y));
}
}
private readonly StrCmp cmp;
public QGramIndex(IList<string> strings, int maxlen, int q) {
m_Maxlen = maxlen;
m_Q = q;
var sb = new StringBuilder(strings.Count * maxlen);
foreach (string str in strings)
sb.AppendFormat(str.PadRight(maxlen, '\u0000'));
m_Data = sb.ToString();
cmp = new StrCmp(m_Data);
MakeSuffixArray();
MakeIndex();
}
public int this[string s] { get { return FindInIndex(s); } }
private void MakeSuffixArray() {
// Approx. runtime: n^3 * log n!!!
// But I claim the shortest ever implementation of a suffix array!
m_SA = Enumerable.Range(0, m_Data.Length).ToArray();
Array.Sort(m_SA, cmp);
}
private int FindInArray(int ith) {
return Array.BinarySearch(m_SA, ith, cmp);
}
private int FindInIndex(string s) {
int idx;
if (!m_Dir.TryGetValue(s, out idx))
return -1;
return m_SA[idx] / m_Maxlen;
}
private string QGram(int i) {
return i > m_Data.Length - m_Q ?
m_Data.Substring(i) :
m_Data.Substring(i, m_Q);
}
private void MakeIndex() {
for (int i = 0; i < m_Data.Length; ++i) {
int pos = FindInArray(i);
if (pos < 0) continue;
m_Dir[QGram(i)] = pos;
}
}
}
static void Main(string[] args) {
var strings = new [] { "hello", "world", "this", "is", "a",
"funny", "test", "which", "i", "have",
"taken", "much", "too", "far", "already" };
var index = new QGramIndex(strings, 10, 3);
var tests = new [] { "xyz", "aki", "ake", "muc", "uch", "too", "fun", "est",
"hic", "ell", "llo", "his" };
foreach (var str in tests) {
int pos = index[str];
if (pos > -1)
Console.WriteLine("\"{0}\" found in \"{1}\".", str, strings[pos]);
else
Console.WriteLine("\"{0}\" not found.", str);
}
}
Upvotes: 13
Reputation: 9079
Choose a minimum search string size (eg. four characters). Go through your list of string entries and build up a dictionary of every four character substring, mapping to a list of entries that the substring appears in. When you do a search, look up based on the first four characters of the search string to get an initial set, then narrow down that initial set to only those that match the full search string.
The worst case of this is O(n), but you'll only get that if your string entries are almost all identical. The lookup dictionary is likely to be quite large, so it's probably a good idea to store it on disk or use a relational database :-)
Upvotes: 0
Reputation:
Here's a WAG for you. I am in NO WAY Knuthian in my algorithm savvy
Okay, so the naiive Trie encodes string keys by starting at the root of the tree and moving down branches that match each letter in the key, starting at the first letter of the key. So the key "foo" would be mapped to (root)->f->fo->foo
and the value would be stored in the location pointed to by the 'foo' node.
You are searching for ANY substring within the key, not just substrings that start at the beginning of the key.
So, what you need to do, is associate a node with ANY key that contains that particular substring. In the foo example I gave before, you would NOT have found a reference to foo's value under the nodes 'f' and 'fo'. In a TST that supports the type of searches you're looking to do, you'd not only find the foo object under all three nodes ('f', 'fo', and 'foo'), you'd also find it under 'o' and 'oo' as well.
There are a couple obvious consequences to expanding the search tree to support this type of indexing. First, you've just exploded the size of the tree. Staggeringly. If you can store it and use it in an efficient manner, your searches will take O(1) time. If your keys remain static, and you can find a way to partition the index so you don't take a huge IO penalty in using it, this might amortize to be worth while.
Second, you are going to find that searches for small strings will result in massive numbers of hits, which may make your search useless unless you, say, put a minimum length on search terms.
On the bright side, you might also find that you can compress the tree via tokenization (like zip compression does) or by compressing nodes that don't branch down (i.e., if you have 'w'->'o'->'o'-> and the first 'o' doesn't branch, you can safely collapse it to 'w'->'oo'). Maybe even a wicked-ass hash could make things easier...
Anyhow, WAG as I said.
Upvotes: 2
Reputation: 1552
would it be possible to "hash" the key value ? basically have a 2nd tree will all the possible values to search for pointing to a list of keys into the 1st tree.
You're going to need 2 trees; 1st one is a hash value to the domain object. the 2nd tree is the search strings to the hash value. the 2nd tree has multiple keys to the same hash value.
Example tree 1: STCKVRFLW -> domain object
tree 2: stack -> STCKVRFLW,STCK over -> STCKVRFLW, VRBRD, VR
So using the search for on the 2nd tree gives you a list of keys to search on the 1st tree.
Upvotes: 0
Reputation: 3615
Would you get any advantage having your trie keys comparable to the size of the machine register? So if you are on a 32bit box you can compare 4 characters at once instead of each character individually? I don't know how bad that would increase the size of your app.
Upvotes: 0