Guy Bowden
Guy Bowden

Reputation: 5107

matching stored keywords/phrases in text

I have a database table with around 1000 keywords/phrases (one to four words long) - This table changes rarely, so I could extract the data into something more useful (like a regular expression?) - So this is not finding / guessing at keywords based on natural language processing..

I then have a user inputting some text into a form that I'd like to match against my keywords and phrases.

The program would then store a link to each phrase matched next to the text.

So if we ran the algorithm on this question text against a few phrases that are in here, we'd get a result like so:

{"inputting some text" : 1,
 "extract the data" : 1,
 "a phrase not here" : 0}

What are my options?

  1. Compile a regular expression
  2. Some sort of SQL query
  3. A third way?

Bearing in mind that there's a 1000 possible phrases..

I'm running Django / Python with MySQL.

edit: I'm currently doing this:

>>> text_input = "This is something with first phrase in and third phrase" 
>>> regex = "first phrase|second phrase|third phrase" 
>>> p = re.compile(regex, re.I) 
>>> p.findall(text_input)
['first phrase','third phrase']

Upvotes: 2

Views: 6726

Answers (6)

Robert Rossney
Robert Rossney

Reputation: 96860

If you have 1000 phrases, and you're searching an input string to find which of those phrases are substrings, you're probably not going to be happy with the performance you get from using a big regular expression. A trie is a bit more work to implement, but it's a lot more efficient: the regular expression a|b|c|d|e does five tests on each character in a given input string, while a trie only does one. You could conceivably also use a lexer, like Plex, that produces a DFA.

Edit:

I appear to be procrastinating this morning. Try this:

    class Trie(object):
        def __init__(self):
            self.children = {}
            self.item = None
        def add(self, item, remainder=None):
            """Add an item to the trie."""
            if remainder == None:
                remainder = item
            if remainder == "":
                self.item = item
            else:
                ch = remainder[0]
                if not self.children.has_key(ch):
                    self.children[ch] = Trie()
                self.children[ch].add(item, remainder[1:])
        def find(self, word):
            """Return True if word is an item in the trie."""
            if not word:
                return True
            ch = word[0]
            if not self.children.has_key(ch):
                return False
            return self.children[ch].find(word[1:])
        def find_words(self, word, results=None):
            """Find all items in the trie that word begins with."""
            if results == None:
                results = []
            if self.item:
                results.append(self.item)
            if not word:
                return results
            ch = word[0]
            if not self.children.has_key(ch):
                return results
            return self.children[ch].find_words(word[1:], results)

A quick test (words.txt is the BSD words file, a very handy thing to have around - it contains about 240,000 words):

>>> t = Trie()
>>> with open(r'c:\temp\words.txt', 'r') as f:
        for word in f:
            t.add(word.strip())

That takes about 15 seconds on my machine. This, however, is almost instantaneous:

>>> s = "I played video games in a drunken haze."
>>> r = []
>>> for i in range(len(s)):
        r.extend(t.find_words(s[i:]))
>>> r
['I', 'p', 'play', 'l', 'la', 'lay', 'a', 'ay', 'aye', 'y', 'ye', 'yed', 'e', 'd', 'v', 'video', 'i', 'id', 'ide', 'd', 'de', 'e', 'o', 'g', 'ga', 'gam', 'game', 'a', 'am', 'ame', 'm', 'me', 'e', 'es', 's', 'i', 'in', 'n', 'a', 'd', 'drunk', 'drunken', 'r', 'run', 'u', 'un', 'unken', 'n', 'k', 'ken', 'e', 'en', 'n', 'h', 'ha', 'haze', 'a', 'z', 'e']

Yes, unken is in words.txt. I have no idea why.

Oh, and I did try to compare with regular expressions:

 >>> import re
 >>> with open(r'c:\temp\words.txt', 'r') as f:
         p = "|".join([l.strip() for l in f])

 >>> p = re.compile(p)

 Traceback (most recent call last):
  File "<pyshell#250>", line 1, in <module>
    p = re.compile(p)
  File "C:\Python26\lib\re.py", line 188, in compile
    return _compile(pattern, flags)
  File "C:\Python26\lib\re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Python26\lib\sre_compile.py", line 529, in compile
    groupindex, indexgroup
OverflowError: regular expression code size limit exceeded

Upvotes: 0

John Machin
John Machin

Reputation: 83002

The algorithm for this job is Aho-Corasick ... see the link at the bottom whch points to a C-extension for Python.

Upvotes: 2

Alex Martelli
Alex Martelli

Reputation: 882481

Once you've formed your pattern such as (first phrase)|(the second)|(and another) (with the parentheses I indicate) and compiled it into a regular expression object r, a good way to loop on matches and identify which match it was is:

class GroupCounter(object):
  def __init__(self, phrases):
    self.phrases = phrases
    self.counts = [0] * len(phrases)
  def __call__(self, mo):
    self.counts[mo.lastindex - 1] += 1
    return ''
  def asdict(self):
    return dict(zip(self.phrases, self.counts))

g = GroupCounter(['first phrase', 'the second', 'and another'])
r.sub(g, thetext)
print g.asdict()

It would also be reasonable to have the GroupCounter instance also build the regex object, since it does need the list of phrases in the same order as it appears in the regex itself.

Upvotes: 0

James Brooks
James Brooks

Reputation: 4375

Here is a slight variation on SilentGhost's answer. You load in the keywords line by line. store them in a set. for each keyword that you find in the user input increase the corresponding entry in the results.

keyword_file = StringIO("""inputting some text
    extract the data
    a phrase not here""")

keywords = set(line.strip() for line in keyword_file)

results = defaultdict(int)
for phrase in keywords:
    if userinput.find(phrase) != -1:
        results[phrase] += 1

print results

Hope this points you in the right direction. Not entirely sure this is what you were asking but it's my best guess.

Do you care about speed? Why don't you like the method you use now? Does your method work?

Upvotes: 0

Jiaaro
Jiaaro

Reputation: 76948

Just a heads up... you may be interested in django's support for regex in queries

Example from the linked django docs:

Entry.objects.get(title__regex=r'^(An?|The) +')

Upvotes: 0

SilentGhost
SilentGhost

Reputation: 319889

If I understand you correctly, you have a unique set of strings, that you want to compare an input strings against. In this case you could use set to store both processing results and db values. Comparison then could be done as follows:

>>> db = {'abc', 'def', 'jhi', 'asdf'}
>>> inpt = {'abc', 'tmp'}
>>> db & inpt
{'abc'}

The further conversion to the dictionary is trivial.

Upvotes: 0

Related Questions