Reputation: 24506
I have an array of words I need to do a find-and-replace by regex operation on, and sometimes this array can be thousands of words long. I've tested and found that stemming the words using common prefixes is much faster than searching for them individually. That is, ^where|why$
is slower than ^wh(ere|y)$
. Obviously it's not a noticeable difference in such a short example, but it's considerably faster where there are thousands of alternatives and the subject string is long.
So I'm looking for a way to do this stemming automatically, for instance to convert a string[] { "what", "why", "where", "when", "which" }
into wh(at|y|e(re|n)|i(ch))
Is there already a recognized algorithm out there that does this ? If not, how would you go about it ? It seems to need to be done recursively but I can't quite get my head round how to do it. I have a method I wrote that works to a limited extent, but it's inelegant, 60 lines longs and uses multiple nested foreach loops so it's a future maintenance nightmare. I'm sure there's a much better way, if anyone could point me in the right direction that'd be much appreciated...
Upvotes: 5
Views: 945
Reputation: 57220
This code should work:
public static class StemmingUtilities
{
private class Node
{
public char? Value { get; private set; }
public Node Parent { get; private set; }
public List<Node> Children { get; private set; }
public Node(char? c, Node parent)
{
this.Value = c;
this.Parent = parent;
this.Children = new List<Node>();
}
}
public static string GetRegex(IEnumerable<string> tokens)
{
var root = new Node(null,null);
foreach (var token in tokens)
{
var current = root;
for (int i = 0; i < token.Length; i++)
{
char c = token[i];
var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
if (node == null)
{
node = new Node(c,current);
current.Children.Add(node);
}
current = node;
}
}
return BuildRexp(root);
}
private static string BuildRexp(Node root)
{
string s = "";
bool addBracket = root.Children.Count > 1;
// uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
// addBracket = addBracket && (root.Parent != null);
if (addBracket)
s += "(";
for(int i = 0; i < root.Children.Count; i++)
{
var child = root.Children[i];
s += child.Value;
s += BuildRexp(child);
if (i < root.Children.Count - 1)
s += "|";
}
if (addBracket)
s += ")";
return s;
}
}
Usage:
var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"
string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"
EDIT:
to get reg2 = "wh(y|at|e(re|n))|a(bc|pple)"
i.e. without the first wrapping brackets, just uncomment the marked line in BuildRexp
method.
Upvotes: 3