Reputation: 1395
Using the code below I am chunking an already tagged and tokenized rss feed. The "print subtree.leaves()" is out-puting:
[('Prime', 'NNP'), ('Minister', 'NNP'), ('Stephen', 'NNP'), ('Harper', 'NNP')] [('U.S.', 'NNP'), ('President', 'NNP'), ('Barack', 'NNP'), ('Obama', 'NNP')] [('what\', 'NNP')] [('Keystone', 'NNP'), ('XL', 'NNP')] [('CBC', 'NNP'), ('News', 'NNP')]
This looks like a python list but I do not know how to access it directly or iterate over it. I think it is a subtree output.
I want to be able to turn this subtree into a list that I can manipulate. Is there an easy way to do this? This is the first time I have encountered trees in python and I am lost. I want to end up with this list:
docs = ["Prime Minister Stephen Harper", "U.S. President Barack Obama", "what\", "Keystone XL", "CBC News"]
Is there a simple way to make this happen?
Thanks, as always for the help!
grammar = r""" Proper: {<NNP>+} """
cp = nltk.RegexpParser(grammar)
result = cp.parse(posDocuments)
nounPhraseDocs.append(result)
for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
# print the noun phrase as a list of part-of-speech tagged words
print subtree.leaves()
print" "
Upvotes: 1
Views: 2147
Reputation: 526
node
has been replaced by label
now. So modifying on Viktor's answer:
docs = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'Proper'):
docs.append(" ".join([a for (a,b) in subtree.leaves()]))
This will give you a list of only those tokens who are a part of the Proper
chuck. You can remove the filter
argument from the subtrees()
method and you'll get a list of all tokens belonging to a particular parent of a tree.
Upvotes: 5
Reputation: 1371
docs = []
for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
docs.append(" ".join([a for (a,b) in subtree.leaves()]))
print docs
This should do the trick.
Upvotes: 1