Riccardo Bucco
Riccardo Bucco

Reputation: 15384

Is there any order in WordNet's synsets?

I am using WordNet to access synonyms that share a common meaning. Here is an example:

from itertools import chain
from nltk.corpus import wordnet as wn

synsets = wn.synsets("drink")
# synsets = [Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), ...]
synonyms = set(chain(*(x.lemma_names() for x in synsets)))
# synonyms = {'drinking', 'drinkable', 'crapulence', 'toast', 'drink', 'drunkenness', ...}

Are synsets sorted? And, in case they are, what are the criteria? Are the first synsets of the list those which have higher chances to be correlated to the given word?

I would like to limit the number of synonyms by keeping only the "most important" ones (what "important" means in this context is to be defined, but I wonder whether WordNet has its own concept of "important").

If synsets are not sorted, what could be an alternative way to find the most appropriate synonyms of a word?

Upvotes: 0

Views: 1821

Answers (2)

andrew
andrew

Reputation: 11

I am a bit late but I was also looking for the order and I found this on their webpage: They are ordered by estimated frequency of usage. On the official website it is written:

"-syns (n | v | a | r )
Display synonyms and immediate hypernyms of synsets containing searchstr. 
Synsets are ordered by estimated frequency of use. [...]"

source: https://wordnet.princeton.edu/documentation/wn1wn

Upvotes: 1

mie.ppa
mie.ppa

Reputation: 125

The documentation has a relevant section: https://www.nltk.org/howto/wordnet.html#similarity

Various similarity finding methods are provided: path_similarity, lch_similarity, wup_similarity, res_similarity, etc.

For example, from the documentation (for path_similarity):

synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.

You can use the method in the following format:

# Assuming we are comparing with 0th synset of "drink"
syn_to_compare = wn.synsets("drink")[0]
all_synsets = wn.synsets("drink")
corr = [(all_synsets[i],syn_to_compare.path_similarity(all_synsets[i])) for i in range(len(all_synsets))]

Will generate an output like:

[(Synset('drink.n.01'), 1.0), (Synset('drink.n.02'), 0.06666666666666667), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('drink.n.04'), 0.09090909090909091), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]

You can then sort them using sorted() method providing the similarity_score as value.

sorted(corr, key=lambda x: x[1] if x[1] != None else 0, reverse=True)
[(Synset('drink.n.01'), 1.0), (Synset('drink.n.04'), 0.09090909090909091), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.n.02'), 0.06666666666666667), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]

If you want to deal with proper nouns, I suggest looking into gensim's most_similar() method.

Are synsets sorted? And, in case they are, what are the criteria? Are the first synsets of the list those which have higher chances to be correlated to the given word?

I cannot answer this question decisively, however I don't think there is a criteria. You can use the above method to find most similar words based on a particular synset.

Edit: As mentioned in the comments below, the author of the question was looking for an order in the list returned by wordnet's synsets() method.

From the code available on Github: https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1563 for the method synset()

if lang == "eng":
    get_synset = self.synset_from_pos_and_offset
    index = self._lemma_pos_offset_map
    if pos is None:
        pos = POS_LIST
    return [
        get_synset(p, offset)
        for p in pos
        for form in self._morphy(lemma, p, check_exceptions)
        for offset in index[form].get(p, [])
    ]

where POS_LIST has the value: POS_LIST = [NOUN, VERB, ADJ, ADV]. Therefore, preference is given the order mentioned above. Furthermore, according to their code: NOUN="n", VERB="v", ADJ="a", ADV="r"

So the order primarily depends on nltk's pos tag based on POS_LIST, followed by what the method _morphy() returns with lemma and pos tag, followed by what _lemma_pos_offset_map() returns.

For example:

>>> POS_LIST = ["n", "v", "a", "r"]
>>> syn = list()
>>> lemma = "drink"
>>> for p in POS_LIST:
...     for form in wn._morphy(lemma, p, True):
...             for offset in wn._lemma_pos_offset_map[form].get(p, []):
...                     syn.append(wn.synset_from_pos_and_offset(p, offset))
... 
>>> syn
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>> # You can verify it with what synsets() is providing
... 
KeyboardInterrupt
>>> wn.synsets("drink")
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>> 

Hope the updated answer is helpful!

Upvotes: 3

Related Questions