puncrazy
puncrazy

Reputation: 349

nltk NER word extraction

I have checked previous related threads, but did not solve my issue. I have written code to get NER from text.

text = "Stallone jason's film Rocky was inducted into the National Film Registry as well as having its film props placed in the Smithsonian Museum."

tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary = True)
print namedEnt
namedEnt = nltk.ne_chunk(tagged, binary = False)

which gives this short of result

(S
  (NE Stallone/NNP)
  jason/NN
  's/POS
  film/NN
  (NE Rocky/NNP)
  was/VBD
  inducted/VBN
  into/IN
  the/DT
  (NE National/NNP Film/NNP Registry/NNP)
  as/IN
  well/RB
  as/IN
  having/VBG
  its/PRP$
  film/NN
  props/NNS
  placed/VBN
  in/IN
  the/DT
  (NE Smithsonian/NNP Museum/NNP)
  ./.)

while I expect only NE as a result, like

Stallone
Rockey
National Film Registry
Smithsonian Museum

how to achieve this?

UPDATE

result = ' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"
print result

gives syntext error, what is correct way to write this?

UPDATE2

text = "Stallone jason's film Rocky was inducted into the National Film Registry as well as having its film props placed in the Smithsonian Museum."

tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary = True)
print namedEnt
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]
print np

error:

 np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 198, in _get_node
    raise NotImplementedError("Use label() to access a node label.")
NotImplementedError: Use label() to access a node label.

so I tried with

np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.label() == "NE"]

which gives emtpy result

Upvotes: 1

Views: 1725

Answers (1)

Phani
Phani

Reputation: 3325

The namedEnt returned is actually a Tree object which is a subclass of list. You can do the following to parse it:

[' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]

Output:

['Stallone', 'Rocky', 'National Film Registry', 'Smithsonian Museum']

The binary flag is set to True will indicate only whether a subtree is NE or not, which is what we need above. When set to False it will give more information like whether the NE is an Organization, Person etc. For some reason, the result with flag On and Off don't seem to agree with one another.

Upvotes: 3

Related Questions