Ivo
Ivo

Reputation: 4200

spacy matcher: detect first word of sentence if pattern is matched

my spacy matcher shows unexpected behaviour and I cannot figure out why. Consider the following toy data:

# %% Load packages
import pandas as pd
import spacy

# %% toy data
df = \
    pd.DataFrame(columns=['date',         'ground_cat',   'ground_word',    'sentence'],
                  data = [["2009-09-01",  "a",            'wutschäumend'  , "Wutschäumend bin ich."], # in this line, "Wutschäumend" should match
                          ["2009-09-01",  "a",            'wutschäumend',   "Ich bin Wutschäumend."],
                          ["2009-09-01",  "neg_a",        'wutschäumend'  , "Ich bin nicht wutschäumend."],
                          ["2009-09-01",  "b",            'zweifelhaftes' , "Peter hat ein zweifelhaftes Verständnis von Gerechtigkeit."],
                          ["2009-09-01",  "c",            'unsittlich',     "Das ist unsittlich."],
                          ["2009-09-01",  "d",            'unsolidarisch' , "Niemand ist so unsolidarisch wie er."]])

df['processed_sentence'] = [doc for doc in nlp.pipe(df['sentence'].tolist())]

ground_x identify the ground_truth, e.g. in row one, category a should be matched by finding the word wutschäumend etc.

I now prepare the matcher and instantiate patterns. Basically, I want the words in matching_dict to match if they are either at the beginning of the sentence or if they are somewhere in the sentence but not preceded by one of the negation words.

These are the patterns:

# %% Prepare Matcher
nlp = spacy.load("de_core_news_lg")
matcher = spacy.matcher.Matcher(nlp.vocab)  # instantiate Matcher

negations = ["nicht", "nichts", "kein", "keine", "keinen", "keinem"] # negation words

matching_dict: dict = {"a": ['wutschäumend'],
                       "b": ['zweifelhaftes'],
                       "c": ["unsittlich"],
                       "d": ["unsolidarisch"]}

# patterns for non-negated words associated with each emotion
a = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['a']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['a']}}]]
b = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['b']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['b']}}]]
c = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['c']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['c']}}]]
d = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['d']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['d']}}]]

matcher.add(201, a)
matcher.add(202, b)
matcher.add(203, c)
matcher.add(204, d)

Now, when I apply this to the toy data, the first sentence is not matched although it should be, and I cannot figure out what is wrong with my pattern. Can someone point out my mistake?

df['matches'] = df['processed_sentence'].apply(matcher)  # match patterns

df['matches']
#                 [] # should be [(201, 0, 2)]!
# 1    [(201, 1, 3)]
# 2               []
# 3    [(202, 2, 4)]
# 4    [(203, 1, 3)]
# 5    [(204, 2, 4)]
# Name: matches, dtype: object

Thanks a lot in advance!

Upvotes: 2

Views: 806

Answers (1)

polm23
polm23

Reputation: 15633

Let's look at these patterns. Keep in mind each dictionary is a token.

a = [
    [{"IS_SENT_START": True}, 
     {"LOWER": {"IN": matching_dict['a']}}], 
    [{"LOWER": {"NOT_IN": negations}}, 
     {"LOWER": {"IN": matching_dict['a']}}]]

There are two patterns here.

In the first, you have the first word of the sentence, and the second word is in your a list.

In the second, you have a word that isn't a negation, followed by a word in the a list.

Your first pattern doesn't match the word at the start of a sentence, which is what you want it to do. You need to make one dictionary per token, so it should look like this:

[{"IS_SENT_START": True, "LOWER": {"IN": matching_dict['a']}]

Upvotes: 1

Related Questions