Reputation: 30246
I have a list of words and would like to keep only nouns.
This is not a duplicate of Extracting all Nouns from a text file using nltk
In the linked question a piece of text is processed. The accepted answer proposes a tagger. I'm aware of the different options for tagging text (nlkt, textblob, spacy), but I can't use them, since my data doesn't consist of sentences. I only have a list of individual words:
would
research
part
technologies
size
articles
analyzes
line
nltk
has a wide selection of corpora. I found verbnet
with a comprehensive list of verbs. But so far I didn't see anything similar for nouns. Is there something like a dictionary, where I can look up if a word is a noun, verb, adjective, etc ?
This could probably done by some online service. Microsoft translate for example returns a lot of information in their responses: https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup?tabs=curl But this is a paid service. I would prefer a python package.
Regarding the ambiguity of words: Ideally I would like a dictionary that can tell me all the functions a word can have. "fish" for example is both noun and verb. "eat" is only verb, "dog" is only noun. I'm aware that this is not an exact science. A working solution would simply remove all words that can't be nouns.
Upvotes: 3
Views: 5305
Reputation: 353
Tried using wordnet?
from nltk.corpus import wordnet
words = ["would","research","part","technologies","size","articles","analyzes","line"]
for w in words:
syns = wordnet.synsets(w)
print(w, syns[0].lexname().split('.')[0]) if syns else (w, None)
You should see:
('would', None)
('research', u'noun')
('part', u'noun')
('technologies', u'noun')
('size', u'noun')
('articles', u'noun')
('analyzes', u'verb')
('line', u'noun')
Upvotes: 3
Reputation: 2079
As @Triplee and @DavidBatista pointed out, it is really complicated to find out if a word is a noun or a verb only by itself, because in most languages, the syntax of a word depends on context.
Words are just representations of meanings. Because of that I'd like to add another proposition that might fit what you mean - instead of trying to find out if a words is a noun or a verb, try to find out if a Concept is an Object or an Action - this still has the problem of ambiguity, because a concept can carry both the Action or Object form.
However, you can stick to Concepts that only has object properties (such as TypeOf, HasAsPart, IsPartOf, etc) or Concepts that have both object and action properties (action properties are such as Subevents, Effects, Requires).
A good tool for Concept Searching is Conceptnet, it provides a WebApi to search for concepts in its network by keyword (it is based of Wikipedia and many other sites and is very complete for english language), is open and also points to synonyms in other languages (that are tagged as their common POS - you could average the POS of the synonyms to try to find out if the word is an object [noun-like] or an action [verb-like]).
Upvotes: 0
Reputation: 189908
You can run a POS tagger on individual fragments, it will have lower accuracy but I suppose that's already a given.
Ideally, find a POS tagger which reveals every possible reading for possible syntactic disambiguation later on in the processing pipeline. This will basically just pick out all the possible readings from the lexicon (perhaps with a probability) and let you take it from there.
Upvotes: 2
Reputation: 3134
Even if you use a dictionary, you will always have to deal with ambiguity, for example, the same word depending on the context can be a noun
or a verb
, take the word research
The government will invest on
research
.The goal is to
research
new techniques of POS-tagging.
Most dictionaries will have more than one definition of research
, example:
Where do these words come from, can you maybe pos-tag them within the context where they occur?
Upvotes: 1