user3710832
user3710832

Reputation: 415

How to extract nouns using NLTK pos_tag()?

I am fairly new to python. I am not able to figure out the bug. I want to extract nouns using NLTK.

I have written the following code:

import nltk

sentence = "At eight o'clock on Thursday film morning word line test best beautiful Ram Aaron design"

tokens = nltk.word_tokenize(sentence)

tagged = nltk.pos_tag(tokens)


length = len(tagged) - 1

a = list()

for i in (0,length):
    log = (tagged[i][1][0] == 'N')
    if log == True:
      a.append(tagged[i][0])

When I run this, 'a' only has one element

a
['detail']

I do not understand why?

When I do it without for loop, that is running

log = (tagged[i][1][0] == 'N')
    if log == True:
      a.append(tagged[i][0])

by change value of 'i' manually from 0 to 'length', i get the output perfectly, but with for loop it only returns the end element. Can someone tell me what is wrong happening with for loop.

'a' should be as follows after the code

['Thursday', 'film', 'morning', 'word', 'line', 'test', 'Ram' 'Aaron', 'design']

Upvotes: 4

Views: 13519

Answers (4)

Habibur Rahman
Habibur Rahman

Reputation: 303

Try This

import nltk

sentence = "At eight o'clock on Thursday film morning word line test best beautiful Ram Aaron design"

tokens = nltk.word_tokenize(sentence)

tagged = nltk.pos_tag(tokens)

length = len(tagged) - 1

a = list()

for i in range(0, length):
    log = (tagged [i][1][0] == 'N')
    if log == True:
        a.append(tagged [i][0])
print a

Upvotes: 0

alvas
alvas

Reputation: 122012

>>> from nltk import word_tokenize, pos_tag
>>> sentence = "At eight o'clock on Thursday film morning word line test best beautiful Ram Aaron design"
>>> nouns = [token for token, pos in pos_tag(word_tokenize(sentence)) if pos.startswith('N')]
>>> nouns
['Thursday', 'film', 'morning', 'word', 'line', 'test', 'Ram', 'Aaron', 'design']

Upvotes: 10

Cory Kramer
Cory Kramer

Reputation: 117856

This line will only loop twice

for i in (0,length):

Once with i = 0 and once with i = length

What you want is

for i in range(length):

Upvotes: 0

Kevin
Kevin

Reputation: 76194

for i in (0,length):

This iterates over two elements, zero and length. If you want to iterate over every number between zero and length, use range.

for i in range(0, length):

Better yet, it's more idiomatic to directly iterate over the elements of a sequence, rather than its index. This will reduce the likelihood of typos like the one above.

for item in tagged:
    if item[1][0] == 'N':
      a.append(item[0])

Size-conscious users may even prefer the one line list comprehension:

a = [item[0] for item in tagged if item[1][0] == 'N']

Upvotes: 8

Related Questions