Reputation: 5107
I have a database table with around 1000 keywords/phrases (one to four words long) - This table changes rarely, so I could extract the data into something more useful (like a regular expression?) - So this is not finding / guessing at keywords based on natural language processing..
I then have a user inputting some text into a form that I'd like to match against my keywords and phrases.
The program would then store a link to each phrase matched next to the text.
So if we ran the algorithm on this question text against a few phrases that are in here, we'd get a result like so:
{"inputting some text" : 1,
"extract the data" : 1,
"a phrase not here" : 0}
What are my options?
Bearing in mind that there's a 1000 possible phrases..
I'm running Django / Python with MySQL.
edit: I'm currently doing this:
>>> text_input = "This is something with first phrase in and third phrase"
>>> regex = "first phrase|second phrase|third phrase"
>>> p = re.compile(regex, re.I)
>>> p.findall(text_input)
['first phrase','third phrase']
Upvotes: 2
Views: 6726
Reputation: 96860
If you have 1000 phrases, and you're searching an input string to find which of those phrases are substrings, you're probably not going to be happy with the performance you get from using a big regular expression. A trie is a bit more work to implement, but it's a lot more efficient: the regular expression a|b|c|d|e
does five tests on each character in a given input string, while a trie only does one. You could conceivably also use a lexer, like Plex, that produces a DFA.
Edit:
I appear to be procrastinating this morning. Try this:
class Trie(object):
def __init__(self):
self.children = {}
self.item = None
def add(self, item, remainder=None):
"""Add an item to the trie."""
if remainder == None:
remainder = item
if remainder == "":
self.item = item
else:
ch = remainder[0]
if not self.children.has_key(ch):
self.children[ch] = Trie()
self.children[ch].add(item, remainder[1:])
def find(self, word):
"""Return True if word is an item in the trie."""
if not word:
return True
ch = word[0]
if not self.children.has_key(ch):
return False
return self.children[ch].find(word[1:])
def find_words(self, word, results=None):
"""Find all items in the trie that word begins with."""
if results == None:
results = []
if self.item:
results.append(self.item)
if not word:
return results
ch = word[0]
if not self.children.has_key(ch):
return results
return self.children[ch].find_words(word[1:], results)
A quick test (words.txt
is the BSD words file, a very handy thing to have around - it contains about 240,000 words):
>>> t = Trie()
>>> with open(r'c:\temp\words.txt', 'r') as f:
for word in f:
t.add(word.strip())
That takes about 15 seconds on my machine. This, however, is almost instantaneous:
>>> s = "I played video games in a drunken haze."
>>> r = []
>>> for i in range(len(s)):
r.extend(t.find_words(s[i:]))
>>> r
['I', 'p', 'play', 'l', 'la', 'lay', 'a', 'ay', 'aye', 'y', 'ye', 'yed', 'e', 'd', 'v', 'video', 'i', 'id', 'ide', 'd', 'de', 'e', 'o', 'g', 'ga', 'gam', 'game', 'a', 'am', 'ame', 'm', 'me', 'e', 'es', 's', 'i', 'in', 'n', 'a', 'd', 'drunk', 'drunken', 'r', 'run', 'u', 'un', 'unken', 'n', 'k', 'ken', 'e', 'en', 'n', 'h', 'ha', 'haze', 'a', 'z', 'e']
Yes, unken
is in words.txt. I have no idea why.
Oh, and I did try to compare with regular expressions:
>>> import re
>>> with open(r'c:\temp\words.txt', 'r') as f:
p = "|".join([l.strip() for l in f])
>>> p = re.compile(p)
Traceback (most recent call last):
File "<pyshell#250>", line 1, in <module>
p = re.compile(p)
File "C:\Python26\lib\re.py", line 188, in compile
return _compile(pattern, flags)
File "C:\Python26\lib\re.py", line 241, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python26\lib\sre_compile.py", line 529, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
Upvotes: 0
Reputation: 83002
The algorithm for this job is Aho-Corasick ... see the link at the bottom whch points to a C-extension for Python.
Upvotes: 2
Reputation: 882481
Once you've formed your pattern such as (first phrase)|(the second)|(and another)
(with the parentheses I indicate) and compiled it into a regular expression object r
, a good way to loop on matches and identify which match it was is:
class GroupCounter(object):
def __init__(self, phrases):
self.phrases = phrases
self.counts = [0] * len(phrases)
def __call__(self, mo):
self.counts[mo.lastindex - 1] += 1
return ''
def asdict(self):
return dict(zip(self.phrases, self.counts))
g = GroupCounter(['first phrase', 'the second', 'and another'])
r.sub(g, thetext)
print g.asdict()
It would also be reasonable to have the GroupCounter instance also build the regex object, since it does need the list of phrases in the same order as it appears in the regex itself.
Upvotes: 0
Reputation: 4375
Here is a slight variation on SilentGhost's answer. You load in the keywords line by line. store them in a set. for each keyword that you find in the user input increase the corresponding entry in the results.
keyword_file = StringIO("""inputting some text
extract the data
a phrase not here""")
keywords = set(line.strip() for line in keyword_file)
results = defaultdict(int)
for phrase in keywords:
if userinput.find(phrase) != -1:
results[phrase] += 1
print results
Hope this points you in the right direction. Not entirely sure this is what you were asking but it's my best guess.
Do you care about speed? Why don't you like the method you use now? Does your method work?
Upvotes: 0
Reputation: 76948
Just a heads up... you may be interested in django's support for regex in queries
Example from the linked django docs:
Entry.objects.get(title__regex=r'^(An?|The) +')
Upvotes: 0
Reputation: 319889
If I understand you correctly, you have a unique set of strings, that you want to compare an input strings against. In this case you could use set
to store both processing results and db values. Comparison then could be done as follows:
>>> db = {'abc', 'def', 'jhi', 'asdf'}
>>> inpt = {'abc', 'tmp'}
>>> db & inpt
{'abc'}
The further conversion to the dictionary is trivial.
Upvotes: 0