extracting from a tagged corpus in python

Question

hi i'm trying exract proper noun from a tagged corpus, lets say for example- from the nltk tagged corpus brown i'm trying to extract the words only tagged with "NP".

my code:

  import nltk
  from nltk.corpus import brown
  f = brown.raw('ca01')
  print nltk.corpus.brown.tagged_words()
  w=[nltk.tag.str2tuple(t) for t in f.split()]
  print w

but it is not showing the words istead it is showing only

[]

sample output:

    [('The', 'AT'), ('Fulton', 'NP-TL'), ...]
    []

why is it??

thanks.

I i only prit f.split()..then i get

             [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL').....

Bogdan · Accepted Answer

Can't really tell from what you've given us, but have you tried going into the problem step by step? It seems that under no circumstances does t.split('/')[1] == 'NP' evaluate to True. So you should start by:

print/debug to see what exactly does f.split() contain
make sure your condition is actually the correct one, from the little sample of output you gave there it looks to me you are looking more for: if t.split('/')[1].startswith('NP') but can't really tell.

EDIT:

Ok, first if that is really what f.split() prints to you then you should get an exception sicne t is a tuple and a tuple doesnt have a split() method. So you made me curious and I installed nltk and downloaded the 'brown' corpus and tried your code. Now first, to me if I do:

  import nltk
  from nltk.corpus import brown
  f = brown.raw('ca01')
  print f.split()

  ['The/at', 'Fulton/np-tl', 'County/nn-tl', 'Grand/jj-tl', 'Jury/nn-tl', 'said/vbd', 'Friday/nr', 'an/at', 'investigation/nn', 'of/in', "Atlanta's/np$", 'recent/jj', 'primary/nn', 'election/nn', 'produced/vbd', '``/``', 'no/at', 'evidence/nn', "''/''", 'that/cs', 'any/dti', 'irregularities/nns', 'took/vbd', 'place/nn', './.', 'The/at', 'jury/nn', 'further/rbr', 'said/vbd', 'in/in', 'term-end/nn', 'presentments/nns', 'that/cs', 'the/at', 'City/nn-tl', 'Executive/jj-tl', 'Committee/nn-tl', ',/,', 'which/wdt', 'had/hvd', 'over-all/jj', 'charge/nn', 'of/in', 'the/at', 'election/nn', ',/,', '``/``', 'deserves/vbz', 'the/at', 'praise/nn', 'and/cc', 'thanks/nns', 'of/in', 'the/at', 'City/nn-tl' .....]

So I have no ideea what you did there to get the result but it was incorrect. Now as you can see from the groups, the second part of the word is in lowercase, that is why your code failed. So if you do:

w=[nltk.tag.str2tuple(t) for t in f.split() if t.split('/')[1].lower() == 'np']

This will get you the result:

[('September-October', 'NP'), ('Durwood', 'NP'), ('Pye', 'NP'), ('Ivan', 'NP'), ('Allen', 'NP'), ('Jr.', 'NP'), ('Fulton', 'NP'), ('Atlanta', 'NP'), ('Fulton', 'NP'), ('Fulton', 'NP'), ('Jan.', 'NP'), ('Fulton', 'NP'), ('Bellwood', 'NP'), ('Alpharetta', 'NP'), ('William', 'NP'), ('B.', 'NP'), ('Hartsfield', 'NP'), ('Pearl', 'NP'), ('Williams', 'NP'), ('Hartsfield', 'NP'), ('Aug.', 'NP'), ('William', 'NP'), ('Berry', 'NP'), ('Jr.', 'NP'), ('Mrs.', 'NP'), ('J.', 'NP'), ('M.', 'NP'), ('Cheshire', 'NP'), ('Griffin', 'NP'), ('Opelika', 'NP'), ('Ala.', 'NP'), ('Hartsfield', 'NP'), ('E.', 'NP'), ('Pelham', 'NP'), ('Henry', 'NP'), ('L.', 'NP'), ('Bowden', 'NP'), ('Hartsfield', 'NP'), ('Atlanta', 'NP'), ('Jan.', 'NP'), ('Ivan', 'NP'), ....]

Now for future reference double check before you post information like the one I asked for, just because if it's not correct then it's missleading and it won't help neither the ones who try to help you, nor yourself. Not as a critic but as constructive advice :)

extracting from a tagged corpus in python

Answers (2)

Related Questions