Reputation: 176
I'm looking for a Python data structure that handles the Penn Treebank structure. This is a sample of what the Treebank looks like:
( (S
(NP-SBJ (PRP He) )
(VP (VBD shouted) )
(. .) ))
Essentially, I would like a data structure that I can ask things like "What are the children of the subject NP?" or "What types of phrases dominate the pronoun?", preferably in Python. Does anyone have a clue?
Upvotes: 3
Views: 2039
Reputation: 363737
I still suggest using NLTK to read the treebank (see e.g. this blog post), but I can imagine it doesn't support this kind of general queries.
"What are the children of the subject NP?"
This would be a dict
, say children
, mapping nonterminals to sets
of either nonterminals or child nodes.
"What types of phrases dominate the pronoun?"
This would be another dict
, say parents
, mapping nonterminals to sets
of nonterminals.
You might want to build a relational database of tree nodes. The exact schema would depend on what kind of queries you want to ask, but be sure to check out the Python sqlite3
module.
Alternatively, you can recode the treebank in XML and use XPath to query it. LXML is the best XML/XPath library for Python, IMHO.
Upvotes: 1
Reputation: 38980
NLTK modules might be a good start to implement Penn Treebank and other NLP related stuff in Python.
Upvotes: 3