Reputation: 365
I want to use dependency parser of spaCy to determine the scope of negation within my docs. See here the dependency visualizer applied to the following string:
RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL
I am able to detect negation cues with
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
As a result I see that not is the negation modifier of got in my string. Now I want to define the scope of the negation with the following:
negation_head_tokens = [token.head for token in negation_tokens]
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
negated_tokens = doc[start:end]
print(negated_tokens)
This gives the following output:
ooopen to Talk about patents with GOOG definitely not the treatment Samsung
Now I have defined the scope, I want to add "not" to certain words conditional on their POS-tag
list = ['ADJ', 'ADV', 'AUX', 'VERB']
for token in negated_tokens:
for i in list:
if token.pos_ == i:
print('not'+token.text)
This gives the following:
notooopen, notTalk, notdefinitely, notnot
I want to exclude notnot from my output and return
RT @trader $AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely the treatment Samsung got heh someurl
How can I achieve this? And do you see improvements in my script from a speed-perspective?
Full script:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL)
list = ['ADJ', 'ADV', 'AUX', 'VERB']
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
negation_head_tokens = [token.head for token in negation_tokens]
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
negated_tokens = doc[start:end]
for token in negated_tokens:
for i in list:
if token.pos_ == i:
print('not'+token.text)
Upvotes: 1
Views: 1344
Reputation: 11657
It's bad form to override Python built-ins like list
- I renamed it pos_list
.
Since "not" is just a regular adverb, it seems the simplest way to avoid it would be with an explicit blacklist. Maybe there is a more "linguistic" way to do it.
I slightly sped up your inner loop.
Code:
doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL')
pos_list = ['ADJ', 'ADV', 'AUX', 'VERB']
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
blacklist = [token.text for token in negation_tokens]
negation_head_tokens = [token.head for token in negation_tokens]
new_doc = []
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
left, right = doc[:start], doc[:end]
negated_tokens = doc[start:end]
for token in doc:
if token in negated_tokens:
if token.pos_ in pos_list and token.text not in blacklist:
# or you can leave out the blacklist and put it here directly
# if token.pos_ in pos_list and token.text not in [token.text for token in negation_tokens]:
new_doc.append('not'+token.text)
continue
else:
pass
new_doc.append(token.text)
print(' '.join(new_doc))
> RT @trader $ AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely not the treatment Samsung got heh someURL
Upvotes: 2