Reputation: 150

Getting synsets of custom hungarian wordnet dictionary with nltk

I am very new to NLP and I might be doing something wrong.

I would like to work with a hungarian text where I can get the synset/hyponym/hypernym of some selected words. I am working in python.

As Open Multilingual Wordnet does not have hungarian wordnet dictionary I have downloaded one from this github site: https://github.com/mmihaltz/huwn

As it is an xml file I have converted it to .tab with a converter available in other language folders.

So at this stage I created the '\nltk_data\corpora\omw\hun' library and placed my new wn-data-hun.tab inside this directory.

But unfortunately it is not working

After importing nltk and wordnet the wn.langs() command shows the 'hun' also as available language.

However trying: wn.lemmas('cane', lang='hun') command is showing an empty list. Trying with other languages (built in languages in open multilanguage wordnet), it works.

Could you pleaes help me or point me in the right direction in order to make it work?

Thank you in advance!

Attached hungarian .tab file: here

Hungarian text:

A szöveg megfelelője gyakorlatilag az összes európai nyelvben "Text" (különböző írásképekkel a nemzeti helyesírás miatt), ami a latin "textum" szóból ered, amely szó eredeti jelentése: szövet, szöveg. A magyarban a nyelvújítás idején a jelentést magyar szóval jelöltük. A szöveg egy összefüggő és a környezetétől jól elhatárolt vagy elhatárolható megnyilvánulás, kijelentés írott vagy tágabb értelemben nem írott de (le)írható nyelven. A nem feltétlenül írott, de leírható szövegre példa a dalszöveg, egy film szövege vagy improvizált színházi szöveg.

The problem is that in case of hungarian language, it does not find anything but in case of french it finds. See below:

Upvotes: 2

Answers (1)

Life is complex

Reputation: 15619

UPDATED 12-04-2021

I would highly recommend reaching out to the repository's owner to understanding the mapping IDs that were used in the huwn.xml.

Here is why:

I cross-referenced the word mappings between .tab files for a specific word.

French mapping for the word 'chien', which is dog in English

02084071-n fra:lemma chien

Italian mapping for the word 'cane', which is dog in English

02084071-n ita:lemma cane

Arabic mapping for the word 'كلْب', which is dog in English

02084071-n arb:lemma كلْب

When I look for the mapping ID 02084071-n in your Hungarian .tab file it does not exist.

These are the mappings IDs in your file for the word 'kutya', which is Hungarian for dog. These mappings ID don't exist in the other .tab files.

09256536-n hun:lemma kutya

02001223-n hun:lemma kutya

Additionally the format of your .tab file still does not match the format of the ones used by Wordnet.

ORIGINAL POST 12-03-2021

I did some research into this question. I noted that the GitHub Hungarian Wordnet repository that you're trying to use has a secondary repository to use the code. Have you tried to use the scripts in that repository?

Also I looked at the format of the huwn.xml file in the first repository. It isn't in the same format as the NLTK omw .tab files.

NLTK file: wn-data-pol.tab

# plWordNet pol http://plwordnet.pwr.wroc.pl/wordnet/   wordnet
00002312-a  pol:lemma   grzbietowy
00002452-n  pol:lemma   rzecz
00002684-n  pol:lemma   obiekt
00004258-n  pol:lemma   indywiduum
00004258-n  pol:lemma   istota żywa
00004258-n  pol:lemma   stworzenie
00004258-n  pol:lemma   twór
00004296-a  pol:lemma   przedagonalny
00004296-a  pol:lemma   przedśmiertny
00004296-a  pol:lemma   przedzgonny
00004296-a  pol:lemma   śmiertelny
00004475-n  pol:lemma   egzemplarz
00004475-n  pol:lemma   jednostka
00004475-n  pol:lemma   organizm
00004475-n  pol:lemma   osobnik
00005107-a  pol:lemma   długometrażowy
00005107-a  pol:lemma   pełnometrażowy
00005205-a  pol:lemma   absolutny
00005787-n  pol:lemma   bentos
00005930-n  pol:lemma   karzeł
00006024-n  pol:lemma   heterotrof

file: huwn.xml

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE WNXML SYSTEM "wnxml.dtd">
<!--
Hungarian WordNet release 2015-06-09
See README.md for more information.
-->
<WNXML>
<SYNSET><ID>ENG20-00001740-n</ID><ID3>ENG30-00001740-n</ID3><POS>n</POS><SYNONYM><LITERAL>entitás<SENSE>1</SENSE></LITERAL></SYNONYM><ILR>ENG20-00002056-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-00005598-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-00016236-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-00017572-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-00022625-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-04253302-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08694995-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08699136-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08843058-n<TYPE>hyponym</TYPE></ILR><DEF>Amit érzékelés, tudás vagy következtetés útján önállóan létezőnek ismerünk (élő vagy élettelen).</DEF><BCS>2</BCS><USAGE>A fizikai test az entitás önmagáról alkotott ideájának anyagi vetülete.</USAGE><STAMP>almasi 2008/03/06</STAMP><DOMAIN>factotum</DOMAIN><SUMO>Physical<TYPE>=</TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-00002056-n</ID><ID3>ENG30-00002452-n</ID3><POS>n</POS><SYNONYM><LITERAL>dolog<SENSE>1</SENSE></LITERAL></SYNONYM><ILR>ENG20-00001740-n<TYPE>hypernym</TYPE></ILR><ILR>ENG20-04179713-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08651117-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08780469-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08797461-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08817117-n<TYPE>hyponym</TYPE></ILR><ILR>ENG20-08869095-n<TYPE>hyponym</TYPE></ILR><DEF>Független, önmagában álló entitás.</DEF><BCS>3</BCS><USAGE>Fiatalabb korában még mindenféle dolog iránt érdeklődött, de manapság csak a tévét bámulja.</USAGE><STAMP>almasi 2008/03/07</STAMP><EKSZ>dolog_1_13<TYPE>=</TYPE></EKSZ></SYNSET>

Based on these major formatting differences I don't see how the huwn.xml can be easily dropped into NLTK Wordnet.

DISCLAIMER:

I'm the author of the Python Package WordHoard, which can be used to obtain antonyms, synonyms, hypernyms, hyponyms, homophones and definitions. WordHoard is designed to perform language translation via 3 difference services. Hungarian is a supported language for translation.

Below is an example of WordHoard translating the Hungarian word 'kutya' and searching multiple sources for synonyms related to 'kutya', which is English for 'dog'.

from wordhoard import Synonyms
from wordhoard.utilities.google_translator import Translator

word = 'kutya'

translated_word = Translator(source_language='hu', str_to_translate=word).translate_word()
synonyms = Synonyms(translated_word).find_synonyms()
reverse_translations = []
for synonym in synonyms:
    reverse_translated_word = Translator(source_language='hu', str_to_translate=synonym).reverse_translate()
    reverse_translations.append(reverse_translated_word)

output_dict = {word: sorted(reverse_translations)}
print(output_dict)
# output 
{'kutya': ['Bow Wow', 'Hot dog', 'andvas', 'az ember legjobb barátja', 'basenji', 'belga griff', 'bitang', 'bolhazsák', 'brüsszeli griff', 'bécsi', 'bécsi virsli', 'cad', 'canis', 'canis familiaris', 'corgi', 'csomag', 'csoroszlya', 'dalmáciai', 'dándog', 'edző kutya', 'farokcsóváló', 'fido', 'fék', 'gazember', 'genus canis', 'griffon', 'házi kutya', 'háziasított állat', 'háziállat', 'játék', 'játék kutya', 'kattintson', 'korcs', 'kuri vagy goorie', 'kurva', 'kutyaszerű', 'kutyavas', 'kutyus', 'kuvasz', 'kölyökkutya', 'mancsot', 'mexikói szőrtelen', 'modortalan fickó', 'mopsz', 'mopsz-kutya', 'munkakutya', 'mutyi', 'nagy pireneusok', 'sarok', 'spicc', 'tépőfog', 'tűzkutya', 'ugató', 'uszkár', 'uszkár kutya', 'vadászkutya', 'virsli', 'visszatartott', 'walesi corgi', 'weenie', 'zászló', 'ölebkutya', 'újfundlandi', 'újfundlandi kutya', 'őszinte']}

P.S. I see that I might need to add some additional code to WordHoard to remove English words like 'Bow Wow' and 'Hot dog' from the output.

Upvotes: 0

Getting synsets of custom hungarian wordnet dictionary with nltk

Answers (1)

UPDATED 12-04-2021

ORIGINAL POST 12-03-2021

Related Questions