oren_isp
oren_isp

Reputation: 779

How to evaluate Word2Vec model

Hi have my own corpus and I train several Word2Vec models on it. What is the best way to evaluate them one against each-other and choose the best one? (Not manually obviously - I am looking for various measures).

It worth noting that the embedding is for items and not word, therefore I can't use any existing benchmarks.

Thanks!

Upvotes: 9

Views: 10722

Answers (3)

skate_23
skate_23

Reputation: 487

One of the ways of evaluating the Word2Vec model would be to apply the K-Means algorithm on the features generated by the Word2Vec. Along with that create your own manual labels/ground truth representing the instances/records. You can calculate the accuracy of the model by comparing the clustered result tags with the ground truth label.

Eg: CLuter 0 - Positive -{"This is a good restaurant", "Good food here", "Not so good dinner"} Cluster 1 - Negative - {"This is a fantastic hotel", "food was stale"}

Now, compare the tags/labels generated by the clusters with the ground truth values of the instances/sentences in the clusters and calculate the accuracy.

Upvotes: 1

addi wei
addi wei

Reputation: 71

One way to evaluate the word2vec model is to develop a "ground truth" set of words. Ground truth will represent words that should ideally be closest together in vector space. For example if your corpus is related to customer service, perhaps the vectors for "dissatisfied" and "disappointed" will ideally have the smallest euclidean distance or largest cosine similarity.

You create this table for ground truth, maybe it has 200 paired words. These 200 words are the most important paired words for your industry / topic. To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.

I like this way better than the "eye-ball" method, whatever that means.

Upvotes: 5

gojomo
gojomo

Reputation: 54173

There's no generic way to assess token-vector quality, if you're not even using real words against which other tasks (like the popular analogy-solving) can be tried.

If you have a custom ultimate task, you have to devise your own repeatable scoring method. That will likely either be some subset of your actual final task, or well-correlated with that ultimate task. Essentially, whatever ad-hoc method you may be using the 'eyeball' the results for sanity should be systematized, saving your judgements from each evaluation, so that they can be run repeatedly against iterative model improvements.

(I'd need more info about your data/items and ultimate goals to make further suggestions.)

Upvotes: 9

Related Questions