Reputation: 11
I am using this code to get all synonyms from the text in document named "answer_tokens.txt" it is only listing the words in the document without the synonyms. can someone check it out?
from nltk.corpus import wordnet
from nltk import word_tokenize
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
#printing the answer tokens word by word as opened
print('==========================================')
synonyms = []
for b in word_tokenize(wn_tokens):
print (str (b))
for b in wordnet.synsets(b):
for l in b.lemmas():
synonyms.append(l.name())
print ('==========================================')
print (set (synonyms))
this is the out put is giving
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
set()
==================================================
This is the output we need
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
'Compare'{'equate', 'comparison', 'compare', 'comparability', 'equivalence', 'liken'}
'dynamic'{'dynamic', 'active', 'dynamical', 'moral_force'}
'case' {'display_case', 'grammatical_case', 'example', 'event', 'causa', 'shell', 'pillow_slip', 'encase', 'character', 'cause', 'font', 'instance', 'type', 'casing', 'guinea_pig', 'slip', 'suit', "typesetter's_case", 'sheath', 'vitrine', 'typeface', 'eccentric', 'lawsuit', 'showcase', 'caseful', 'fount', 'subject', 'pillowcase', "compositor's_case", 'face', 'incase', 'case'}
'data' {'data', 'information', 'datum', 'data_point'}
'changing'{'modify', 'interchange', 'convert', 'alter', 'switch', 'transfer', 'commute', 'change', 'vary', 'deepen', 'changing', 'ever-changing', 'shift', 'exchange'}
'example ' {'example', 'exemplar', 'object_lesson', 'representative', 'good_example', 'exercise', 'instance', 'deterrent_example', 'lesson', 'case', 'illustration', 'model'}
'watching' {'watch', 'observation', 'view', 'watching', 'watch_out', 'check', 'look_on', 'ascertain', 'learn', 'watch_over', 'observe', 'follow', 'observance', 'take_in', 'look_out', 'find_out', 'keep_an_eye_on', 'catch', 'determine', 'see'}
'video' {'video_recording', 'video', 'television', 'picture', 'TV', 'telecasting'}
==================================================
Upvotes: 1
Views: 805
Reputation: 141
First of all, the initialisation synonyms = []
should be made for each token separately, since you want to build a different list of synonyms for each token. So I've moved that instruction into the for loop iterating over the tokens.
The second problem with your code is that you use the same variable name for both iterating over the tokens and over the synsets of the current token. This way you lose the information over the token itself, rendering you unable to print it afterwards.
Lastly, the printing of the set of synonyms should be made for each token (as you yourself stated in the question), so I have moved the print
statement at the end of the for loop iterating over the tokens.
Here is the code:
from nltk.corpus import wordnet as wn
from nltk import word_tokenize as wTok
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
print(wn_tokens)
print('==========================================')
for token in wTok(wn_tokens):
synonyms = []
for syn in wn.synsets(token):
for l in syn.lemmas():
synonyms.append(l.name())
print(token, set(synonyms))
print ('==========================================')
Hope this helps!
P.S. It could be useful to give aliases to the imports you make, in order to streamline the coding process. I have given the wn
alias to the wordnet
module and the wTok
alias to the word_tokenize
module.
Upvotes: 1