Reputation: 4200
my spacy matcher shows unexpected behaviour and I cannot figure out why. Consider the following toy data:
# %% Load packages
import pandas as pd
import spacy
# %% toy data
df = \
pd.DataFrame(columns=['date', 'ground_cat', 'ground_word', 'sentence'],
data = [["2009-09-01", "a", 'wutschäumend' , "Wutschäumend bin ich."], # in this line, "Wutschäumend" should match
["2009-09-01", "a", 'wutschäumend', "Ich bin Wutschäumend."],
["2009-09-01", "neg_a", 'wutschäumend' , "Ich bin nicht wutschäumend."],
["2009-09-01", "b", 'zweifelhaftes' , "Peter hat ein zweifelhaftes Verständnis von Gerechtigkeit."],
["2009-09-01", "c", 'unsittlich', "Das ist unsittlich."],
["2009-09-01", "d", 'unsolidarisch' , "Niemand ist so unsolidarisch wie er."]])
df['processed_sentence'] = [doc for doc in nlp.pipe(df['sentence'].tolist())]
ground_x
identify the ground_truth, e.g. in row one, category a
should be matched by finding the word wutschäumend
etc.
I now prepare the matcher and instantiate patterns. Basically, I want the words in matching_dict
to match if they are either at the beginning of the sentence or if they are somewhere in the sentence but not preceded by one of the negation words.
These are the patterns:
# %% Prepare Matcher
nlp = spacy.load("de_core_news_lg")
matcher = spacy.matcher.Matcher(nlp.vocab) # instantiate Matcher
negations = ["nicht", "nichts", "kein", "keine", "keinen", "keinem"] # negation words
matching_dict: dict = {"a": ['wutschäumend'],
"b": ['zweifelhaftes'],
"c": ["unsittlich"],
"d": ["unsolidarisch"]}
# patterns for non-negated words associated with each emotion
a = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['a']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['a']}}]]
b = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['b']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['b']}}]]
c = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['c']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['c']}}]]
d = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['d']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['d']}}]]
matcher.add(201, a)
matcher.add(202, b)
matcher.add(203, c)
matcher.add(204, d)
Now, when I apply this to the toy data, the first sentence is not matched although it should be, and I cannot figure out what is wrong with my pattern. Can someone point out my mistake?
df['matches'] = df['processed_sentence'].apply(matcher) # match patterns
df['matches']
# [] # should be [(201, 0, 2)]!
# 1 [(201, 1, 3)]
# 2 []
# 3 [(202, 2, 4)]
# 4 [(203, 1, 3)]
# 5 [(204, 2, 4)]
# Name: matches, dtype: object
Thanks a lot in advance!
Upvotes: 2
Views: 806
Reputation: 15633
Let's look at these patterns. Keep in mind each dictionary is a token.
a = [
[{"IS_SENT_START": True},
{"LOWER": {"IN": matching_dict['a']}}],
[{"LOWER": {"NOT_IN": negations}},
{"LOWER": {"IN": matching_dict['a']}}]]
There are two patterns here.
In the first, you have the first word of the sentence, and the second word is in your a
list.
In the second, you have a word that isn't a negation, followed by a word in the a
list.
Your first pattern doesn't match the word at the start of a sentence, which is what you want it to do. You need to make one dictionary per token, so it should look like this:
[{"IS_SENT_START": True, "LOWER": {"IN": matching_dict['a']}]
Upvotes: 1