Reputation: 1801
I am attempting to write a function that will return a list of NLTK definitions for the 'tokens' tokenized from a text document subject to constraint of part of speech of the word.
I first converted the tag given by nltk.pos_tag to the tag used by wordnet.synsets and then applied .word_tokenize(), .pos_tag(), .synsets in turn, as seen in the following code:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert the tag to the one used by wordnet.synsets
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#tokenize, tag, and find synsets (give the first match between each 'token' and 'word net_tag')
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = wn.synsets(token, wordnet_tag)
return syns[0]
#test
doc = 'document is a test'
doc_to_synsets(doc)
which, if programmed correctly, should return something like
[Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')]
However, Python throws an error message:
'list' object has no attribute 'lower'
I also noticed that in the error message, it says
lemma = lemma.lower()
Does that mean I also need to 'lemmatize' my tokens as this previous thread suggest? Or should I apply .lower() on the text document before doing all these?
I will rather new to wordnet, don't really know whether it's .synsets that is causing the problem or it's the nltk part that is at fault. It will be really appreciated if someone could enlighten me on this.
Thank you.
[Edit] error traceback
AttributeError Traceback (most recent call last)
<ipython-input-49-5bb011808dce> in <module>()
22 return syns
23
---> 24 doc_to_synsets('document is a test.')
25
26
<ipython-input-49-5bb011808dce> in doc_to_synsets(doc)
18 tag = nltk.pos_tag(token)
19 wordnet_tag = convert_tag(tag)
---> 20 syns = wn.synsets(token, wordnet_tag)
21
22 return syns
/opt/conda/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py in synsets(self, lemma, pos, lang, check_exceptions)
1481 of that language will be returned.
1482 """
-> 1483 lemma = lemma.lower()
1484
1485 if lang == 'eng':
AttributeError: 'list' object has no attribute 'lower'
So after using the code kindly suggested by @dugup and $udiboy1209, I get the following output
[[Synset('document.n.01'),
Synset('document.n.02'),
Synset('document.n.03'),
Synset('text_file.n.01'),
Synset('document.v.01'),
Synset('document.v.02')],
[Synset('be.v.01'),
Synset('be.v.02'),
Synset('be.v.03'),
Synset('exist.v.01'),
Synset('be.v.05'),
Synset('equal.v.01'),
Synset('constitute.v.01'),
Synset('be.v.08'),
Synset('embody.v.02'),
Synset('be.v.10'),
Synset('be.v.11'),
Synset('be.v.12'),
Synset('cost.v.01')],
[Synset('angstrom.n.01'),
Synset('vitamin_a.n.01'),
Synset('deoxyadenosine_monophosphate.n.01'),
Synset('adenine.n.01'),
Synset('ampere.n.02'),
Synset('a.n.06'),
Synset('a.n.07')],
[Synset('trial.n.02'),
Synset('test.n.02'),
Synset('examination.n.02'),
Synset('test.n.04'),
Synset('test.n.05'),
Synset('test.n.06'),
Synset('test.v.01'),
Synset('screen.v.01'),
Synset('quiz.v.01'),
Synset('test.v.04'),
Synset('test.v.05'),
Synset('test.v.06'),
Synset('test.v.07')],
[]]
The problem now comes down to extracting the first match (or first element) of each list from the list 'syns' and make them into a new list. For the trial document 'document is a test', it should return:
[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]
which is a list of the first match for each token in the text document.
Upvotes: 1
Views: 4192
Reputation: 426
The problem is that wn.synsets
expects a single token as its first argument but word_tokenize
returns a list containing all of the tokens in the document. So your token
and tag
variables are actually lists.
You need to loop through all of the token-tag pairs in your document and generate a synset for each individually using something like:
tokens = nltk.word_tokenize(doc)
tags = nltk.pos_tag(tokens)
doc_synsets = []
for token, tag in zip(tokens, tags):
wordnet_tag = convert_tag(tag)
syns = wn.synsets(token, wordnet_tag)
# only add the first matching synset to results
doc_synsets.append(syns[0])
Upvotes: 3
Reputation: 1502
lower()
is a function of str
type, which basically returns a lower-case version of the string.
It looks like nltk.word_tokenize()
returns a list of words, and not a single word. But to synsets()
you need to pass a single str, and not a list of str.
You may want to try running synsets
in a loop like so:
for token in nltk.word_tokenize(doc):
syn = wn.synsets(token)
EDIT: better use list comprehensions to get a list of syns
syns = [wn.synsets(token) for token in nltk.word_tokenize(doc)]
Upvotes: 1