Reputation: 1118

Convert WordNet files to .txt

I need to convert the WordNet database files (noun.shape, noun.state, verb.cognition ecc) from their custom extension to .txt in order to more easily extract their nouns, verbs, adjectives and adverbs in their custom category. In other words, in "DATABASE FILES ONLY" you'll find the files I'm looking for, unfortunately they have a .STATE or .SHAPE extension. They are readable in the notepad but I need a list with all the items in those files without their definition in parenthesis.

Upvotes: 1

Answers (1)

alvas

Reputation: 122142

If you're using WordNet simply as a dictionary, you can try Open Multilingual WordNet, see http://compling.hss.ntu.edu.sg/omw/

import os, codecs

from nltk.corpus import wordnet as wn

# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
  reader = codecs.open(wnfile, "r", "utf8").readlines()
  wn = {}
  for l in reader:
    if l[0] == "#": continue
    if option=="ss":
      k = l.split("\t")[0] #ss as key
      v = l.split("\t")[2][:-1] #word
    else:
      v = l.split("\t")[0] #ss as value
      k = l.split("\t")[2][:-1] #word as key
    try:
      temp = wn[k]
      wn[k] = temp + ";" + v
    except KeyError:
      wn[k] = v  
  return wn

if not os.path.exists('msa/wn-data-zsm.tab'):
    os.system('wget http://compling.hss.ntu.edu.sg/omw/wns/zsm.zip')
    os.system('unzip zsm.zip')

msa_wn = readWNfile('msa/wn-data-zsm.tab')
eng_wn_keys = {(str(i.offset).zfill(8) + '-'+i.pos).decode('utf8'):i for i in wn.all_synsets()}

for i in set(eng_wn_keys).intersection(msa_wn.keys()):
    print eng_wn_keys[i], msa_wn[i]

Meanwhile, hold on for a while because the NLTK developers are going to put the Open Multilingual Wordnet API together soon, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py from line 1048

Upvotes: 1

Convert WordNet files to .txt

Answers (1)

Related Questions