Sebastian Zeki
Sebastian Zeki

Reputation: 6874

POS tagging base form of noun

I would like to search a sentence based on the presence of a noun of any form. Processing becomes an issue when Im looking through very large texts and searching for all the different permutations of POS tags for a noun- namely NN,NNS,NNPS,NNS (and probably others). My question is therefore whether NN is the base form of all the other noun variants and whether I can just search for NN. The same applies to adverbs, pronouns and verbs. Do POS tags have base forms?

Upvotes: 0

Views: 394

Answers (3)

duhaime
duhaime

Reputation: 27594

You can also simplify the Stanford part of speech tags with the following code:

text = nltk.word_tokenize("And now for something completely different")
posTagged = pos_tag(text)

print posTagged

simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged]
print(simplifiedTags)

Which yields:

[('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]

Via this question on SO.

Upvotes: 1

alvas
alvas

Reputation: 122052

The PennTree Bank (PTB) tagset used by NLTK's pos_tag is somewhat hierarchical for nouns and verbs. You can use the first character of the tag to see if it is a noun or verb.

try this:

>>> from nltk import word_tokenize, pos_tag
>>> sent = 'this is a foo bar sentence with many nouns talking to blah blah black sheep.'
>>> pos_tag(word_tokenize(sent))
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN'), ('with', 'IN'), ('many', 'JJ'), ('nouns', 'NNS'), ('talking', 'VBG'), ('to', 'TO'), ('blah', 'VB'), ('blah', 'NN'), ('black', 'NN'), ('sheep', 'NN'), ('.', '.')]
>>> token_pos = []
>>> for token, pos in pos_tag(word_tokenize(sent)):
...     if pos[0] in ['N', 'V']:
...             pos = 'noun' if pos[0] == 'N' else 'verb'
...     token_pos.append((token, pos))
... 
>>> token_pos
[('this', 'DT'), ('is', 'verb'), ('a', 'DT'), ('foo', 'noun'), ('bar', 'noun'), ('sentence', 'noun'), ('with', 'IN'), ('many', 'JJ'), ('nouns', 'noun'), ('talking', 'verb'), ('to', 'TO'), ('blah', 'verb'), ('blah', 'noun'), ('black', 'noun'), ('sheep', 'noun'), ('.', '.')]
>>> [i for i in token_pos if i[1] == 'noun']
[('foo', 'noun'), ('bar', 'noun'), ('sentence', 'noun'), ('nouns', 'noun'), ('blah', 'noun'), ('black', 'noun'), ('sheep', 'noun')]
>>> [i for i in token_pos if i[1] == 'verb']
[('is', 'verb'), ('talking', 'verb'), ('blah', 'verb')]

Upvotes: 1

There are no base forms for POS tags. Worse, there is no general consensus on POS tags for English. However, some standards do have a certain pattern, and they are often fairly small sets.

By the list that you give, (NN,NNS,NNPS,NNS), you are probably using the Penn Treebank (PTB) set. It is also one of the most commonly used. All the noun tags in PTB do begin with 'NN'.

You can find the list of POS tags in PTB here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

For comparison, here is the Brown POS tag set: http://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used. In this case, they do not all begin by NN.

Upvotes: 1

Related Questions