Reputation: 596
In Stanford Dependency Manual they mention "Stanford typed dependencies" and particularly the type "neg" - negation modifier. It is also available when using Stanford enhanced++ parser using the website. for example, the sentence:
"Barack Obama was not born in Hawaii"
The parser indeed find neg(born,not)
but when I'm using the stanfordnlp
python library, the only dependency parser I can get will parse the sentence as follow:
('Barack', '5', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '5', 'aux:pass')
('not', '5', 'advmod')
('born', '0', 'root')
('in', '7', 'case')
('Hawaii', '5', 'obl')
and the code that generates it:
import stanfordnlp
stanfordnlp.download('en')
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a = doc.sentences[0]
a.print_dependencies()
Is there a way to get similar results to the enhanced dependency parser or any other Stanford parser that result in typed dependencies that will give me the negation modifier?
Upvotes: 6
Views: 3842
Reputation: 739
An alternative is SpaCy ( https://spacy.io/api/dependencyparser )
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')
def printInfo(doc):
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_,
token.shape_, token.is_alpha,
token.is_stop, token.ent_type_, token.dep_, token.head.text)
doc = nlp("Barack Obama was not born in Hawaii")
printInfo(doc)
and the output is:
Barack Barack PROPN NNP Xxxxx True False PERSON compound Obama
Obama Obama PROPN NNP Xxxxx True False PERSON nsubjpass born
was be AUX VBD xxx True True auxpass born
not not PART RB xxx True True neg born
born bear VERB VBN xxxx True False ROOT born
in in ADP IN xx True True prep born
Hawaii Hawaii PROPN NNP Xxxxx True False GPE pobj in
Upvotes: 1
Reputation: 566
Year 2021:
NOTE: Run this code from the terminal, it won't work from notebook because of some stdin compatibility issues.
import os
os.environ["CORENLP_HOME"] = "./stanford-corenlp-4.2.0"
import pandas as pd
from stanza.server import CoreNLPClient
Upvotes: 1
Reputation: 41
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
offset = 0 # keeps track of token offset for each sentence
for sentence in ann.sentence:
print('___________________')
print('dependency parse:')
# extract dependency parse
dp = sentence.basicDependencies
# build a helper dict to associate token index and label
token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
offset += len(sentence.token)
# build list of (source, target) pairs
out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]
for source, target in out_parse:
print(source, token_dict[source], '->', target, token_dict[target])
print('\nTokens \t POS \t NER')
for token in sentence.token:
print (token.word, '\t', token.pos, '\t', token.ner)
This outputs the following for the first sentence:
___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in
Tokens POS NER
Barack NNP PERSON
Obama NNP PERSON
was VBD O
born VBN O
in IN O
Hawaii NNP STATE_OR_PROVINCE
. . O
Upvotes: 4
Reputation: 126
It is to note the python library stanfordnlp is not just a python wrapper for StanfordCoreNLP.
As said on the stanfordnlp Github repo:
The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.
Stanfordnlp contains a new set of neural networks models, trained on the CONLL 2018 shared task. The online parser is based on the CoreNLP 3.9.2 java library. Those are two different pipelines and sets of models, as explained here.
Your code only accesses their neural pipeline trained on CONLL 2018 data. This explains the differences you saw compared to the online version. Those are basically two different models.
What adds to the confusion I believe is that both repositories belong to the user named stanfordnlp (which is the team name). Don't be fooled between the java stanfordnlp/CoreNLP and the python stanfordnlp/stanfordnlp.
Concerning your 'neg' issue, it seems that in the python libabry stanfordnlp, they decided to consider the negation with an 'advmod' annotation altogether. At least that is what I ran into for a few example sentences.
However, you can still get access to the CoreNLP through the stanfordnlp package. It requires a few more steps, though. Citing the Github repo,
There are a few initial setup steps.
- Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)
- Put the model jars in the distribution folder
- Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05
Once that is done, you can start a client, with code that can be found in the demo :
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
print('---')
print('dependency parse of first sentence')
dependency_parse = sentence.basicDependencies
print(dependency_parse)
#get the tokens of the first sentence
#note that 1 token is 1 node in the parse tree, nodes start at 1
print('---')
print('Tokens of first sentence')
for token in sentence.token :
print(token)
Your sentence will therefore be parsed if you specify the 'depparse' annotator (as well as the prerequisite annotators tokenize, ssplit, and pos). Reading the demo, it feels that we can only access basicDependencies. I have not managed to make Enhanced++ dependencies work via stanfordnlp.
But the negations will still appear if you use basicDependencies !
Here is the output I obtained using stanfordnlp and your example sentence. It is a DependencyGraph object, not pretty, but it is unfortunately always the case when we use the very deep CoreNLP tools. You will see that between nodes 4 and 5 ('not' and 'born'), there is and edge 'neg'.
node {
sentenceIndex: 0
index: 1
}
node {
sentenceIndex: 0
index: 2
}
node {
sentenceIndex: 0
index: 3
}
node {
sentenceIndex: 0
index: 4
}
node {
sentenceIndex: 0
index: 5
}
node {
sentenceIndex: 0
index: 6
}
node {
sentenceIndex: 0
index: 7
}
node {
sentenceIndex: 0
index: 8
}
edge {
source: 2
target: 1
dep: "compound"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 2
dep: "nsubjpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 3
dep: "auxpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 4
dep: "neg"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 7
dep: "nmod"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 8
dep: "punct"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 7
target: 6
dep: "case"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
root: 5
---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false
word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false
word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false
word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false
I will not go into details on this one, but there is also a solution to access the CoreNLP server via the NLTK library , if all else fails. It does output the negations, but requires a little more work to start the servers. Details on this page
I figured I could also share with you the code to get the DependencyGraph into a nice list of 'dependency, argument1, argument2' in a shape similar to what stanfordnlp outputs.
from stanfordnlp.server import CoreNLPClient
text = "Barack Obama was not born in Hawaii."
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
dependency_parse = sentence.basicDependencies
#print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
#print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
#print(dir(dependency_parse.edge))
#get a dictionary associating each token/node with its label
token_dict = {}
for i in range(0, len(sentence.token)) :
token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word
#get a list of the dependencies with the words they connect
list_dep=[]
for i in range(0, len(dependency_parse.edge)):
source_node = dependency_parse.edge[i].source
source_name = token_dict[source_node]
target_node = dependency_parse.edge[i].target
target_name = token_dict[target_node]
dep = dependency_parse.edge[i].dep
list_dep.append((dep,
str(source_node)+'-'+source_name,
str(target_node)+'-'+target_name))
print(list_dep)
It ouputs the following
[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]
Upvotes: 7
Reputation: 16620
I believe there is likely a discrepancy between the model which was used to generate dependencies for documentation and the one that is available online hence the difference. I would raise the issue with stanfordnlp
library maintainers directly via GitHub issues.
Upvotes: 1