JFerro
JFerro

Reputation: 3433

python spacy looking for two (or more) words in a window

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other. For instance a concept would be any of the words forest, tree, nature in a distance less than 4 words from fire, burn, overheat

I am learning spacy and so far I can use the matcher like this:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])

That would match hello world and hello, world (or tree firing for the above mentioned example)

I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.

I had a look into: https://spacy.io/usage/rule-based-matching

and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.

Furthermore, I am not able to generalize that to more words as well.

Some ideas? Thanks

Upvotes: 3

Views: 3802

Answers (2)

Inspector6
Inspector6

Reputation: 576

I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:

[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]

I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').

Upvotes: 2

David Dale
David Dale

Reputation: 11424

For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.

Thus, you can write your matcher as

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"},  {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])

which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for

doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

it will print you

15578876784678163569 HelloWorld 0 4 Hello brave new world

And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.

Upvotes: 3

Related Questions