Reputation: 116850
I am using this python library that implements the Aho-Corasick string searching algorithm that finds a set of patterns in a given string in one pass. The output is not what I am expecting:
In [4]: import ahocorasick
In [5]: import collections
In [6]: tree = ahocorasick.KeywordTree()
In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"
In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]
In [9]: for w in words:
...: tree.add(w)
...:
In [10]: tree.make()
In [13]: final = collections.defaultdict(int)
In [15]: for match in tree.findall(ss, allow_overlaps=True):
....: final[ss[match[0]:match[1]]] += 1
....:
In [16]: final
{ 'the first sentence': 3, 'the first sentence is': 2}
The output I was expecting was this:
{
'the first sentence': 3,
'the first sentence is': 2,
'first sentence': 3,
'first sentence is': 2
}
Am I missing something? I am doing this on large strings so post processing is not my first option. Is there a way to get the desired output?
Upvotes: 1
Views: 226
Reputation: 462
I don't know about the ahocorasick
module, but those results seem suspect. The acora module shows this:
import acora
import collections
ss = "this is the first sentence in this book "
"the first sentence is really the most interesting "
"the first sentence is always first"
words = ["first sentence is",
"first sentence",
"the first sentence",
"the first sentence is"]
tree = acora.AcoraBuilder(*words).build()
for match in tree.findall(ss):
result[match] += 1
Results:
>>> result
defaultdict(<type 'int'>,
{'the first sentence' : 3,
'first sentence' : 3,
'first sentence is' : 2,
'the first sentence is': 2})
Upvotes: 1
Reputation: 134005
The way I understand the Aho-Corasick algorithm and the way I've implemented it would have me agree with your expected output. It looks like the Python library you're using is in error, or perhaps there's a flag that you can tell it to give you all matches starting at a position rather than just the longest match starting at a particular position.
The examples in the original paper, http://www.win.tue.nl/~watson/2R080/opdracht/p333-aho-corasick.pdf, support your understanding.
Upvotes: 1