Reputation: 33
I have a SortedDictionary<string, string>, ordered by key length descending, of the form:
I need to find the value of the longest key in the dictionary that matches a substring in a given phrase. Given that the dictionary has about one thousand entries, is there a better approach than a brute force search?
The examples above should return:
"red fox" - address1
"foxes" - address3
I have the brute force solution of iterating over the Keys, but in view of the number of keys and the need to perform the search in tens of phrases, I'm looking for a cleverer approach.
-1
I did some performance measuring.
With a dictionary of 100 keys, searching for matches in 100 phrases took on the average 65 ms.
If the execution time scales linearly with the dictionary length and with the number of phrases, then the whole process will take about 20 seconds (search time only).
Taking into account the time needed to read the phrases from files and to make use of the matches, it will take probably 3-4 minutes for the whole run.
The user will have to look at some animation for this time, or she could drink a glass of something.
Would this be acceptable?
If yes, the brute force approach is OK.
If not, I am still open to suggestions.
Upvotes: 0
Views: 129
Reputation: 33
I solved it with a different approach:
Given:
For each phrase in the files, find the longest 1- to 3-word sub-phrase matching one of the dictionary keys.
Solution:
For each phrase p in F, create a List containing all 1-, 2- and 3-word sub-phrases, ordered by length descending; there will be 3W-3 sub-phrases.
For each sub-phrase in the List, check if it matches a key in the Dictionary.
(In contrast, the brute force approach, for each phrase p in each file, iterates over the dictionary keys, using p.Contains(key) to find a match.)
Performance:
The time spent is linearly proportional to F, P and W and constant for the dictionary lookup.
This is an improvement over the brute force approach, which is proportional to the dictionary size as well. Thus, the performance gain goes up with the dictionary size. Additionally, the brute force approach uses string.Contains() which is time consuming as well.
To sum up, the brute force solution is O(N) while the sub-phrase solution is O(1).
I think there is no need to post any code, which is just a simple double loop which builds the sub-phrases list.
Upvotes: 0
Reputation: 31
Judging from your requirement the brute-force approach is really not a best way to do it. A much more efficient approach would be to use a "Trie" (prefix tree) type. Using "Trie" can help you efficiently search for the longest matching key in a given phrase. Just create a "Tire" structure, than insert each key from your "SortedDictionary" into the "Trie". Each node in the "Trie" structure will represent a character of a key. At the end of each key, store the corresponding value (address). For the search for longest match you can implement like: For each phrase, iterate through each substring and search for it in the "Trie" structure. Than just track the longest match found for each phrase. I believe it what you need. Take a look here: A Trie implementation in C#, Using Trie Class for Efficient Text Pattern Searching in C#
I just put some code below. I think in general it does what you need:
public sealed class Trie
{
private readonly TrieNode _root = new();
public void Insert(string key, string address)
{
var node = _root;
foreach (var ch in key)
{
if (!node.Children.ContainsKey(ch))
{
node.Children[ch] = new TrieNode();
}
node = node.Children[ch];
}
node.Address = address;
}
public string SearchLongestMatch(string phrase, out string address)
{
address = null;
var maxLength = 0;
TrieNode longestMatchNode = null;
for (var i = 0; i < phrase.Length; i++)
{
var node = _root;
var length = 0;
for (var j = i; j < phrase.Length; j++)
{
var ch = phrase[j];
if (node.Children.TryGetValue(ch, out var child))
{
node = child;
length++;
if (length > maxLength)
{
maxLength = length;
longestMatchNode = node;
}
}
else
{
break;
}
}
}
if (longestMatchNode != null)
{
address = longestMatchNode.Address;
var index = phrase.IndexOf(longestMatchNode.Address, StringComparison.Ordinal);
if (index != -1)
{
return phrase.Substring(index, maxLength);
}
}
return null;
}
}
public class TrieNode
{
public Dictionary<char, TrieNode> Children { get; set; } = new();
public string Address { get; set; }
}
public sealed class DescendingComparer<T> : IComparer<T> where T : IComparable<T>
{
public int Compare(T? x, T? y)
{
return y.CompareTo(x);
}
}
public class Program
{
public static void Main()
{
var dict = new SortedDictionary<string, string>(new DescendingComparer<string>())
{ { "red fox", "address1" }, { "weasel", "address2" }, { "foxes", "address3" }, { "fox", "address4" } };
var trie = new Trie();
foreach (var kvp in dict)
{
trie.Insert(kvp.Key, kvp.Value);
}
var phrases = new List<string> { "today the red fox is home", "no foxes today" };
foreach (var phrase in phrases)
{
var match = trie.SearchLongestMatch(phrase, out var address);
Console.WriteLine($"Phrase: \"{phrase}\" - Match: \"{match}\" - Address: \"{address}\"");
}
}
}
Upvotes: 0