python spacy looking for two (or more) words in a window

Question

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other. For instance a concept would be any of the words forest, tree, nature in a distance less than 4 words from fire, burn, overheat

I am learning spacy and so far I can use the matcher like this:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])

That would match hello world and hello, world (or tree firing for the above mentioned example)

I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.

I had a look into: https://spacy.io/usage/rule-based-matching

and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.

Furthermore, I am not able to generalize that to more words as well.

Some ideas? Thanks

David Dale · Accepted Answer

For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.

Thus, you can write your matcher as

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"},  {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])

which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for

doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

it will print you

15578876784678163569 HelloWorld 0 4 Hello brave new world

And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.

python spacy looking for two (or more) words in a window

Answers (2)

Related Questions