Akbar Hussein
Akbar Hussein

Reputation: 360

Spacy matcher with regex across tokens

I have the following sentences:

phrases = ['children externalize their emotions through outward behavior',
         'children externalize hidden emotions.',
         'children externalize internalized emotions.',
         'a child might externalize a hidden emotion through misbehavior',
         'a kid might externalize some emotions through behavior',
         'traumatized children externalize their hidden trauma through bad behavior.',
         'The kid is externalizing internal traumas',
         'A child might externalize emotions though his outward behavior',
         'The kid externalized a lot of his emotions through misbehavior.']

I want to catch whatever noun comes after the verb externalize; externalizing, externalizes, etc

In this case; we should get:

externalize their emotions
externalize hidden emotions
externalize internalized emotions
externalize a hidden emotion
externalize some emotions
externalize their hidden trauma
externalizing internal traumas
externalized a lot of his emotions

So far I am able to catch only the noun if it comes after the verb externalize

I want to catch the noun; if it happens to be after less than 15 characters. for example: externalize a lot of emotions That should be matched; because ( a lot of his ) is only 14 characters; counting the spaces.

Here is my working which is far from perfect.

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher =  Matcher(vocab = nlp.vocab)
verb_noun = [{'POS':'VERB'}, {'POS':'NOUN'}]
matcher.add('verb_noun', None, verb_noun)

list_result = []
for phrase in phrases:
    doc = nlp(phrase)
    doc_match = matcher(doc)
    if doc_match:
        for match in doc_match:
            start = match[1]
            end = match[2]
            result = doc[start:end]
            result = [i.lemma_ for i in result]
            if 'externaliz' in result[0].lower():
                result = ' '.join(result)
                list_result.append(result)

Upvotes: 1

Views: 403

Answers (1)

polm23
polm23

Reputation: 15633

I want to catch the noun; if it happens to be after less than 15 characters. for example: externalize a lot of emotions That should be matched; because ( a lot of his ) is only 14 characters; counting the spaces.

You can do this, though I wouldn't recommend it. What you should do is write a regex to match against the string and use Doc.char_span to create a Match. Since the Matcher works on tokens, using a heuristic like "14 characters, including spaces" cannot be implemented reasonably. Also that kind of heuristic is a hack and will perform erratically.

I suspect what you actually want to do is figure out what is being externalized, that is, to find the object of the verb. In that case you should use the DependencyMatcher. Here's an example of using it with a simple rule and merging noun chunks:

import spacy

from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")

texts = ['children externalize their emotions through outward behavior',
         'children externalize hidden emotions.',
         'children externalize internalized emotions.',
         'a child might externalize a hidden emotion through misbehavior',
         'a kid might externalize some emotions through behavior',
         'traumatized children externalize their hidden trauma through bad behavior.',
         'The kid is externalizing internal traumas',
         'A child might externalize emotions though his outward behavior',
         'The kid externalized a lot of his emotions through misbehavior.']

pattern = [
  {
    "RIGHT_ID": "externalize",
    "RIGHT_ATTRS": {"LEMMA": "externalize"}
  },
  {
    "LEFT_ID": "externalize",
    "REL_OP": ">",
    "RIGHT_ID": "object",
    "RIGHT_ATTRS": {"DEP": "dobj"}
  },
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("EXTERNALIZE", [pattern])

# what was externalized?

# this is optional: merge noun phrases
nlp.add_pipe("merge_noun_chunks")

for doc in nlp.pipe(texts):
    for match_id, tokens in  matcher(doc):
        # tokens[0] is like "externalize"
        print(doc[tokens[1]])

Output:

their emotions
hidden emotions
internalized emotions
a hidden emotion
some emotions
their hidden trauma
internal traumas
emotions
his outward behavior
a lot

Upvotes: 1

Related Questions