Joko
Joko

Reputation: 176

Python Data Structure for Treebank?

I'm looking for a Python data structure that handles the Penn Treebank structure. This is a sample of what the Treebank looks like:

( (S
    (NP-SBJ (PRP He) )
    (VP (VBD shouted) )
    (. .) ))

Essentially, I would like a data structure that I can ask things like "What are the children of the subject NP?" or "What types of phrases dominate the pronoun?", preferably in Python. Does anyone have a clue?

Upvotes: 3

Views: 2039

Answers (2)

Fred Foo
Fred Foo

Reputation: 363737

I still suggest using NLTK to read the treebank (see e.g. this blog post), but I can imagine it doesn't support this kind of general queries.

"What are the children of the subject NP?"

This would be a dict, say children, mapping nonterminals to sets of either nonterminals or child nodes.

"What types of phrases dominate the pronoun?"

This would be another dict, say parents, mapping nonterminals to sets of nonterminals.

You might want to build a relational database of tree nodes. The exact schema would depend on what kind of queries you want to ask, but be sure to check out the Python sqlite3 module.

Alternatively, you can recode the treebank in XML and use XPath to query it. LXML is the best XML/XPath library for Python, IMHO.

Upvotes: 1

Praveen Gollakota
Praveen Gollakota

Reputation: 38980

NLTK modules might be a good start to implement Penn Treebank and other NLP related stuff in Python.

Upvotes: 3

Related Questions