Reputation: 116840
I have a string like this:
inputString = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"
and a dictionary like this:
{
'always first': 0,
'book the': 0,
'first': 0,
'first sentence': 0,
'in this': 0,
'interesting the': 0,
'is always': 0,
'is really': 0,
'is the': 0,
'most interesting': 0,
'really the': 0,
'sentence in': 0,
'sentence is': 0,
'the first': 0,
'the first sentence': 0,
'the first sentence is': 0,
'the most': 0,
'this': 0,
'this book': 0,
'this is': 0
}
What is the most efficient way of updating the frequency counts of this dictionary in one pass of the input string (if it is possible)? I get a feeling that there must be a parser technique to do this but am not an expert in this area so am stuck. Any suggestions?
Upvotes: 0
Views: 362
Reputation: 16959
Just go through the string and use the dictionary as you would normally to increment any occurance. This is O(n), since dictionary lookup is often O(1). I do this regularly, even for large word collections.
Upvotes: 0
Reputation: 47183
When confronted with this problem, I think, "I know, I'll use regular expressions".
Start off by making a list of all the patterns, sorted by decreasing length:
patterns = sorted(counts.keys(), key=len, reverse=True)
Now make that into a single massive regular expression which is an alternation between each of the patterns:
allPatterns = re.compile("|".join(patterns))
Now run that pattern over the input string, and count up the number of hits on each pattern as you go:
pos = 0
while (True):
match = allPatterns.search(inputString, pos)
if (match is None): break
pos = match.start() + 1
counts[match.group()] = counts[match.group()] + 1
You will end up with the counts of each of the strings.
(An aside: i believe most good regular expression libraries will compile a large alternation over fixed strings like this using the Aho-Corasick algorithm that e.dan mentioned. Using a regular expression library is probably the easiest way of applying this algorithm.)
With one problem: where a pattern is a prefix of another pattern (eg 'first' and 'first sentence'), only the longer pattern will have got a count against it. This is by design: that's what the sort by length at the start was for.
We can deal with this as a postprocessing step; go through the counts, and whenever one pattern is a prefix of another, add the longer pattern's counts to the shorter pattern's. Be careful not to double-add. That's simply done as a nested loop:
correctedCounts = {}
for donor in counts:
for recipient in counts:
if (donor.startswith(recipient)):
correctedCounts[recipient] = correctedCounts.get(recipient, 0) + counts[donor]
That dictionary now contains the actual counts.
Upvotes: 2
Reputation: 67860
The Aho–Corasick seems definitely the way to go, but if I needed a simple Python implementation, I'd write:
import collections
def consecutive_groups(seq, n):
return (seq[i:i+n] for i in range(len(seq)-n))
def get_snippet_ocurrences(snippets):
split_snippets = [s.split() for s in snippets]
max_snippet_length = max(len(sp) for sp in split_snippets)
for group in consecutive_groups(inputString.split(), max_snippet_length):
for lst in split_snippets:
if group[:len(lst)] == lst:
yield " ".join(lst)
print collections.Counter(get_snippet_ocurrences(snippets))
# Counter({'the first sentence': 3, 'first sentence': 3, 'the first': 3, 'first': 3, 'the first sentence is': 2, 'this': 2, 'this book': 1, 'in this': 1, 'book the': 1, 'most interesting': 1, 'really the': 1, 'sentence in': 1, 'is really': 1, 'sentence is': 1, 'is the': 1, 'interesting the': 1, 'this is': 1, 'the most': 1})
Upvotes: 2