Anonymous Coder
Anonymous Coder

Reputation: 1

Why accuracy is 0%

https://github.com/Saranja-Navaneethakumar/WSD_Skipgram/blob/main/skipgram.py

I'm doing word sense disambiguation with word2vec skipgaram model and train & test datasets - 2017 SemEval benchmark dataset but I tried above code but it fails and accuracy becomes 0% I need to increase accuracy Is there any way to increase it or any other mistakes in my code?. Can I use any other version of benchmark dataset?

Upvotes: 0

Views: 60

Answers (1)

gojomo
gojomo

Reputation: 54183

There are likely mistakes in your code or dataset. You need to examine the results of each step, to make sure its partial results "make sense" according to your programmer's intent.

For example:

  • If you print the first few items in your preprocessed_dataset, is each item a list of words, as Word2Vec expects and you intended? If not, there are problems above that point.

  • If you enable Python logging to at least the INFO level, does the training of the Word2Vec model show progress matching your understanding of the algorithm? That is, does it show it discovering the right number of, and an adequate number of, individual texts & words? (A 100-dimensional model will only work well with tens of thousands of unique words, and many hundreds of thousands of total training words.)

  • When model training is done, do tests of individual words' lists of most-similar neighbor-words give reasonable results/rankings?

If those all pass, you should next consider if your "polysemy" test data and test method actually make any sense. As you may be aware, standard Word2Vec has no intrinsic idea of whether a single orthographic word (one token) has multiple alternate senses.

Instead, the wod-vector for that one token tends to be pulled somewhere between the alternate senses. That may, in high-dimensional spaces, allow it to have a mix of neighbors related to each of its 2 or more senses. But there's no guaranty that its closest neighbor will be any particular synonym, or related-word, to either of its senses.

And yet, even without seeing your test data's expectations, that seems to be what you're testing: whether a particular token (target_word) has as its nearest-neighbor a particular other token (label), which would have to be another word in the known-vocabulary.

If your unshown 'labels' are in fact some other non-natural-word named-categories-of-shared senses, there's no chance that the Word2Vec model will return those strings as the neared-neighbor of your target words.

So: you may have made some errors in model-training, which careful inspection of individual steps might reveal. But your final test of target_word to label matches, as some sort of probe of polysemy, is highly suspect as something that might have no chance of working, depending on your test data & theory of the problem. (Is this a published technique with any track record of success?)

(As a final note: min_count=1 is always a bad idea with Word2Vec, which only creates good word-vectors with many varied examples of a word's use, and including words with only a few examples even sabotages the quality of other words with adequately-more examples. Any time you are tempted to lower the default of min_count=5, you are likely trying to use Word2Vec on too little data for it to help, and should be trying to find more data, or adopt some other technique, instead of changing this value to a quality-sabotaging value.)

Upvotes: 0

Related Questions