Reputation: 24049
I have a huge list of larger spaCy documents and a list of words which I want to look up in the document. An example: I want to look up the word "Aspirin" in a website text, which was parsed with spaCy. The list of keywords I want to look up is quite long.
Don't use spacy and just use if keyword in website_text:
as a simple matcher. Of course this has the downside that tokens are ignored and searches for test
will yield false positives at words like tested
, attested
, etc.
Matcher
are an option, but I would need to automatically build a lot of matchers based on my list of keywords.
Is there a recommended way to achieve this task?
Upvotes: 1
Views: 1103
Reputation: 769
I'd go with your naive approach, but you can use regular expressions to get a smarter match that won't pick up false positives.
For example, \b(test|aspirin)\b
picks up on the words "test" and "aspirin", but not on "aspiring", "attested", or "testing". You could add other words inside the brackets, separated by pipes, to pick up more key words.
Here's an example of it working.
To actually apply that to Python code, you can use the re module.
Upvotes: 1