Reputation: 1118
I need to convert the WordNet database files (noun.shape, noun.state, verb.cognition ecc) from their custom extension to .txt in order to more easily extract their nouns, verbs, adjectives and adverbs in their custom category. In other words, in "DATABASE FILES ONLY" you'll find the files I'm looking for, unfortunately they have a .STATE or .SHAPE extension. They are readable in the notepad but I need a list with all the items in those files without their definition in parenthesis.
Upvotes: 1
Views: 1455
Reputation: 122142
If you're using WordNet simply as a dictionary, you can try Open Multilingual WordNet
, see http://compling.hss.ntu.edu.sg/omw/
import os, codecs
from nltk.corpus import wordnet as wn
# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
reader = codecs.open(wnfile, "r", "utf8").readlines()
wn = {}
for l in reader:
if l[0] == "#": continue
if option=="ss":
k = l.split("\t")[0] #ss as key
v = l.split("\t")[2][:-1] #word
else:
v = l.split("\t")[0] #ss as value
k = l.split("\t")[2][:-1] #word as key
try:
temp = wn[k]
wn[k] = temp + ";" + v
except KeyError:
wn[k] = v
return wn
if not os.path.exists('msa/wn-data-zsm.tab'):
os.system('wget http://compling.hss.ntu.edu.sg/omw/wns/zsm.zip')
os.system('unzip zsm.zip')
msa_wn = readWNfile('msa/wn-data-zsm.tab')
eng_wn_keys = {(str(i.offset).zfill(8) + '-'+i.pos).decode('utf8'):i for i in wn.all_synsets()}
for i in set(eng_wn_keys).intersection(msa_wn.keys()):
print eng_wn_keys[i], msa_wn[i]
Meanwhile, hold on for a while because the NLTK developers are going to put the Open Multilingual Wordnet API together soon, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py from line 1048
Upvotes: 1