Reputation: 49
hi i'm trying exract proper noun from a tagged corpus, lets say for example- from the nltk tagged corpus brown i'm trying to extract the words only tagged with "NP".
my code:
import nltk
from nltk.corpus import brown
f = brown.raw('ca01')
print nltk.corpus.brown.tagged_words()
w=[nltk.tag.str2tuple(t) for t in f.split()]
print w
but it is not showing the words istead it is showing only
[]
sample output:
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[]
why is it??
thanks.
I i only prit f.split()..then i get
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL').....
Upvotes: 0
Views: 1172
Reputation: 8246
Can't really tell from what you've given us, but have you tried going into the problem step by step? It seems that under no circumstances does t.split('/')[1] == 'NP'
evaluate to True. So you should start by:
f.split()
containif t.split('/')[1].startswith('NP')
but can't really tell.EDIT:
Ok, first if that is really what f.split()
prints to you then you should get an exception sicne t
is a tuple and a tuple doesnt have a split()
method. So you made me curious and I installed nltk
and downloaded the 'brown' corpus and tried your code. Now first, to me if I do:
import nltk
from nltk.corpus import brown
f = brown.raw('ca01')
print f.split()
['The/at', 'Fulton/np-tl', 'County/nn-tl', 'Grand/jj-tl', 'Jury/nn-tl', 'said/vbd', 'Friday/nr', 'an/at', 'investigation/nn', 'of/in', "Atlanta's/np$", 'recent/jj', 'primary/nn', 'election/nn', 'produced/vbd', '``/``', 'no/at', 'evidence/nn', "''/''", 'that/cs', 'any/dti', 'irregularities/nns', 'took/vbd', 'place/nn', './.', 'The/at', 'jury/nn', 'further/rbr', 'said/vbd', 'in/in', 'term-end/nn', 'presentments/nns', 'that/cs', 'the/at', 'City/nn-tl', 'Executive/jj-tl', 'Committee/nn-tl', ',/,', 'which/wdt', 'had/hvd', 'over-all/jj', 'charge/nn', 'of/in', 'the/at', 'election/nn', ',/,', '``/``', 'deserves/vbz', 'the/at', 'praise/nn', 'and/cc', 'thanks/nns', 'of/in', 'the/at', 'City/nn-tl' .....]
So I have no ideea what you did there to get the result but it was incorrect. Now as you can see from the groups, the second part of the word is in lowercase, that is why your code failed. So if you do:
w=[nltk.tag.str2tuple(t) for t in f.split() if t.split('/')[1].lower() == 'np']
This will get you the result:
[('September-October', 'NP'), ('Durwood', 'NP'), ('Pye', 'NP'), ('Ivan', 'NP'), ('Allen', 'NP'), ('Jr.', 'NP'), ('Fulton', 'NP'), ('Atlanta', 'NP'), ('Fulton', 'NP'), ('Fulton', 'NP'), ('Jan.', 'NP'), ('Fulton', 'NP'), ('Bellwood', 'NP'), ('Alpharetta', 'NP'), ('William', 'NP'), ('B.', 'NP'), ('Hartsfield', 'NP'), ('Pearl', 'NP'), ('Williams', 'NP'), ('Hartsfield', 'NP'), ('Aug.', 'NP'), ('William', 'NP'), ('Berry', 'NP'), ('Jr.', 'NP'), ('Mrs.', 'NP'), ('J.', 'NP'), ('M.', 'NP'), ('Cheshire', 'NP'), ('Griffin', 'NP'), ('Opelika', 'NP'), ('Ala.', 'NP'), ('Hartsfield', 'NP'), ('E.', 'NP'), ('Pelham', 'NP'), ('Henry', 'NP'), ('L.', 'NP'), ('Bowden', 'NP'), ('Hartsfield', 'NP'), ('Atlanta', 'NP'), ('Jan.', 'NP'), ('Ivan', 'NP'), ....]
Now for future reference double check before you post information like the one I asked for, just because if it's not correct then it's missleading and it won't help neither the ones who try to help you, nor yourself. Not as a critic but as constructive advice :)
Upvotes: 4
Reputation: 49886
One imagines that t.split('/')[1] == 'NP'
is always evaluating to false.
Upvotes: 0