Pythonic
Pythonic

Reputation: 2131

NLTK and Stanford Dependency Parser - How to get word position?

Can I get the word's position when using Stanford's dependency parser via NLTK as shown in this SO question?

This is the source code for reference

~~~

Example:

When I use Stanford's Dependency Parser via NLTK, using the example in the SO post referenced above I get a list of tuples like this:

[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
((u'elephant', u'NN'), u'det', (u'an', u'DT')),
((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))]

whereas when I use the online tool I also get a pointer to the word's position, see the digits in the text below:

nsubj(shot-2, I-1)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)

The info about the word position is non-trivial in some specific cases*, so is it still possible to get it via NLTK?

(*) for the specific cases: think technical texts where jargon acronyms are substituted with plain English keywords to ease the parser's job

Upvotes: 2

Views: 529

Answers (1)

Igor
Igor

Reputation: 1281

Not sure if there is a way to get this from the triples directly. But, if I recall correctly, you call deps.triples() on your dependencies to get them in this triple format. On that dependencies object (deps above), you can also call deps.get_by_address(i) to get the word at the specified index. You could try if these are connected (i.e. whatever object you get from .get_by_address(position) and every item in the deps.triples()). If so, you can make a dictionary before from dep triple to position. And .get_by_address() is 1-based (not 0-based), as the 0 is always the root node.

EDIT: Just found out that .triples() just seems to return a list of tuples, doesn't look like anything fancy from which you can retrieve for ex. position info. The following may help you though (sorry for the German example):

s = 'Ich werde nach Hause gehen .'
res = depParser.parse(s.split()) # can use a simple .split since my input is already tokenised
deps = res.__next__()
traverse(deps, 0) # 0 is always the root node

traversing then goes as follows:

def traverse(deps, addr):

dep = deps.get_by_address(addr)
print(dep)
for d in dep['deps']:
    for addr2 in dep['deps'][d]:
        traverse(deps, addr2)

Which should just recursively walk through all dependencies in the graph, and gives me the following output:

{'word': None, 'head': None, 'address': 0, 'lemma': None, 'feats': None, 'ctag': 'TOP', 'deps': defaultdict(<class 'list'>, {'root': [3]}), 'tag': 'TOP', 'rel': None}
{'word': 'nach', 'head': 0, 'address': 3, 'lemma': '_', 'rel': 'root', 'ctag': 'VBP', 'feats': '_', 'deps': defaultdict(<class 'list'>, {'dobj': [5], 'nsubj': [2]}), 'tag': 'VBP'}
{'word': 'gehen', 'head': 3, 'address': 5, 'lemma': '_', 'rel': 'dobj', 'ctag': 'NN', 'feats': '_', 'deps': defaultdict(<class 'list'>, {'amod': [4]}), 'tag': 'NN'}
{'word': 'Hause', 'head': 5, 'address': 4, 'lemma': '_', 'rel': 'amod', 'ctag': 'JJ', 'feats': '_', 'deps': defaultdict(<class 'list'>, {}), 'tag': 'JJ'}
{'word': 'werde', 'head': 3, 'address': 2, 'lemma': '_', 'rel': 'nsubj', 'ctag': 'NNP', 'feats': '_', 'deps': defaultdict(<class 'list'>, {'compound': [1]}), 'tag': 'NNP'}
{'word': 'Ich', 'head': 2, 'address': 1, 'lemma': '_', 'rel': 'compound', 'ctag': 'NNP', 'feats': '_', 'deps': defaultdict(<class 'list'>, {}), 'tag': 'NNP'}

In a slightly different format than the .triples() you are using, but hope this helps.

Upvotes: 1

Related Questions