How to search a text for compound phrases that may be separated in the text; in python?

Question

Assume I have a text and want to check if it contains some compound phrase, where I also want to include the cases where the respective words may not be directly followed by each other.

For example, assume you want to check if a text is about firefighters, then a text like this

text = "currently there are over 4000 people involved in fighting the rapidly growing fires in Australia"

should also yield a positive result. (I actually want to apply this to german, where examples may be less artificial)

I have no expertise in NLP, so maybe there is some clever way to do this, and I just do not know the correct term to search for. Of course, if the text is not too large, one could do the following exhaustive search on all 2-word-combinations:

import itertools
import spacy

nlp = spacy.load({model})
doc = nlp(text)
wordlist =[t.lemma_ for t in doc if (not t.is_punct and not t.is_stop and not t.is_digit)]

combs = itertools.combinations(wlist,2)
comb_set = [set(c) for c in combs]

{'fire','fight'} in comb_set

But I was thinking that there might be a more efficient way to do this.

David Dale · Accepted Answer

If you want just to check that the lemmas "fire" and "fight" are present in the text, then instead of explicitly generating all the combinations (quadratic complexity), you can just check that these lemmas both belong to the set of all lemmas (linear complexity):

# !python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
text = "currently there are over 4000 people involved in fighting the rapidly growing fires in Australia"
doc = nlp(text)
lemmas = {token.lemma_ for token in doc}
print('fire' in lemmas and 'fight' in lemmas) # True

You may also want to check that the words "fire" and "fight" are directly related to each other - so that your rule doesn't activate on the text "I light the fire and watch the smoke fight with the mosquitoes".

You can achieve this by checking that the word "fight" is the syntactic head of the word "fire". This test is also linear in complexity (if the syntactic parser is linear, as in spacy), so it should scale well to large texts.

def check_phrase(text, head, child):
    return any((t.lemma_ == child and t.head.lemma_ == head) for t in nlp(text))

text = "currently there are over 4000 people involved in fighting the rapidly growing fires in Australia"
print(check_phrase(text, 'fight', 'fire'))  # True

another_text = "I light the fire and watch the smoke fight with the mosquitoes"
print(check_phrase(another_text, 'fight', 'fire'))  # False

How to search a text for compound phrases that may be separated in the text; in python?

Answers (1)

Related Questions