Aristotle Tan Yi Sing
Aristotle Tan Yi Sing

Reputation: 105

Stanford Universal Dependencies on Python NLTK

Is there any way I can get the Universal dependencies using python, or nltk?I can only produce the parse tree.

Example:

Input sentence:

My dog also likes eating sausage.

Output:

Universal dependencies

nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)

Upvotes: 3

Views: 3336

Answers (1)

dimazest
dimazest

Reputation: 56

Wordseer's stanford-corenlp-python fork is a good start as it works with the recent CoreNLP release (3.5.2). However it will give you raw output, which you need manually transform. For example, given you have the wrapper running:

>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.')))  # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
                                   [u'nsubj',
                                    u'loves',
                                    u'2',
                                    u'John',
                                    u'1'],
                                   [u'dobj', u'loves', u'2', u'Mary', u'3'],
                                   [u'punct', u'loves', u'2', u'.', u'4']],
                 u'parsetree': [],
                 u'text': u'John loves Mary.',
                 u'words': [[u'John',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'John',
                              u'PartOfSpeech': u'NNP'}],
                            [u'loves',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'love',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'Mary',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'15',
                              u'Lemma': u'Mary',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'15',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'.',
                              u'PartOfSpeech': u'.'}]]}]}

In case you want to use dependency parser, you can reuse NLTK's DependencyGraph with a bit of effort

>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
...    server.parse(
...       'John loves Mary. '
...       'I saw a man with a telescope. '
...       'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
...    )
... )['sentences']
>>>
>>> def transform(sentence):
...     for rel, _, head, word, n in sentence['dependencies']:
...         n = int(n)
...
...         word_info = sentence['words'][n - 1][1]
...         tag = word_info['PartOfSpeech']
...         lemma = word_info['Lemma']
...         if rel == 'root':
...             # NLTK expects that the root relation is labelled as ROOT!
...             rel = 'ROOT'
...
...         # Hack: Return values we don't know as '_'.
...         #       Also, consider tag and ctag to be equal.
...         # n is used to sort words as they appear in the sentence.
...         yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
...     DependencyGraph(
...         ' '.join(items)  # NLTK expects an iterable of strings...
...         for n, *items in sorted(transform(parse))
...     )
...     for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
 (('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
 (('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4))  # doctest: +NORMALIZE_WHITESPACE
Ballmer     NNP     4       nsubj
has         VBZ     4       aux
been        VBN     4       cop
vocal       JJ      0       ROOT
in          IN      4       prep
the         DT      8       det
past        JJ      8       amod
warning     NN      5       pobj
that        WDT     13      dobj
Linux       NNP     13      nsubj
is          VBZ     13      cop
a           DT      13      det
threat      NN      8       rcmod
to          TO      13      prep
Microsoft   NNP     14      pobj
.           .       4       punct
<BLANKLINE>

Setting up CoreNLP is not that hard, check http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html for more details.

Upvotes: 4

Related Questions