guerda
guerda

Reputation: 24049

How to check if a token in present in a document with spaCy?

I have a huge list of larger spaCy documents and a list of words which I want to look up in the document. An example: I want to look up the word "Aspirin" in a website text, which was parsed with spaCy. The list of keywords I want to look up is quite long.

Naive approach

Don't use spacy and just use if keyword in website_text: as a simple matcher. Of course this has the downside that tokens are ignored and searches for test will yield false positives at words like tested, attested, etc.

Use spaCy's matchers

Matcher are an option, but I would need to automatically build a lot of matchers based on my list of keywords.

Is there a recommended way to achieve this task?

Upvotes: 1

Views: 1103

Answers (1)

Peritract
Peritract

Reputation: 769

I'd go with your naive approach, but you can use regular expressions to get a smarter match that won't pick up false positives.

For example, \b(test|aspirin)\b picks up on the words "test" and "aspirin", but not on "aspiring", "attested", or "testing". You could add other words inside the brackets, separated by pipes, to pick up more key words.

Here's an example of it working.

To actually apply that to Python code, you can use the re module.

Upvotes: 1

Related Questions