Ian
Ian

Reputation: 193

Using pygtrie, how do you find all the words in some text that have been added to a trie?

If I have...

from pygtrie import Trie

trie = Trie()
trie["Grand Canyon"] = true
trie["New York"] = true

How do I search the trie such that it returns all the key names found in some text?

I would expect there to be something like...

matches = trie.find("Grand Canyon, New York, I like strawberries.")

... but I can't find it in the docs.

Upvotes: 1

Views: 132

Answers (1)

user24714692
user24714692

Reputation: 4949

  • Here, we are dealing with phrases (e.g., New York City) and not words (e.g., Apple, Orange).
  • Therefore, the preprocessing or postprocessing of the our texts will be complicated.
from pygtrie import Trie
import re


def _matches(trie, s):
    p = r'(?:(?:[A-Z][A-Z.-]*[a-z-]*\s*){1,}|[a-z-]+)'
    words = re.findall(p, s)
    res = []
    for i, ph in enumerate(words):
        phrase = ph.strip()
        if trie.has_key(phrase):
            res.append(phrase)

    return res


trie = Trie()
trie["Grand Canyon"] = True
trie["New York"] = True
trie["Alice M. Bob"] = True
trie["strawberries"] = True

s = """

Grand Canyon, New York, I like strawberries. Alice Bob has two coins x-y ... 
Grand Canyon, New York City, I like strawberries. Alice Bob has two coins x-y ... 
Alice M. Bob some words Firstname Middle Lastname

"""


print(_matches(trie, s))


Prints

['Grand Canyon', 'New York', 'strawberries', 'Grand Canyon', 'strawberries', 'Alice M. Bob']

Note:

  • Here, we use a pattern to partially pull out our phrases of interests. The pattern can be modified as needed: (?:(?:[A-Z][A-Z.-]*[a-z-]*\s*){1,}|[a-z-]+).

Upvotes: -1

Related Questions