Stumbler
Stumbler

Reputation: 2146

Apparently random vector plotting: TSNE

I have successfully created a vector model using Gensim's word2vec library. Distance between related vectors is good (that is to say that derived similarity makes sense from a human perspective).

However, attempting to map these vectors to a graph has proven challenging. Naturally the N dimensions of the vectors needs to be reduced in order to be plot-able: and to that end I've used TSNE.

import gensim, logging, os
import codecs
import numpy as np
import matplotlib.pyplot as plt
import gensim, logging, os
from sklearn.manifold import TSNE

wvs = model.syn1neg
vocabulary = model.wv.vocab


tsne = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y = tsne.fit_transform(wvs[::])

plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(vocabulary, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()

However, the resulting points relating to the vectors appear to be essentially random - there is just a single huge cluster with a couple of outliers.

math py graph

A case in point: note the nearest neighbours to "hallucinating" at the edge of the cluster

vectors

But the actual vectors returned usng model.most_similar() are

[('agitated', 0.7707732319831848), ('restless', 0.740711510181427), ('disorientated', 0.7242116332054138), ('confused', 0.7215688228607178), ('aggressive', 0.71 69249057769775), ('drowsy', 0.6654224395751953), ('tearful', 0.6573441624641418) , ('aggitated', 0.6566967964172363), ('sleepy', 0.6562871932983398), ('shaking', 0.6419488191604614)]

What way can I begin to approach this to try and make the output be more sensible?

Upvotes: 0

Views: 507

Answers (1)

Grr
Grr

Reputation: 16079

Absolutely read the article linked by @MattiLyra. Beyond that based on what I know (without seeing the actual data), you may want to increase the n_iter parameter a good bit. Usually 1000 iterations isn't going to take you to a static state. Moreover, you may also want to play around with the method parameter. The documentation for sklearn.manifold.TSNE states:

"By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples."

If you change the method to 'exact' you can use the n_iter_without_progress parameter and basically allow the model to find a static point.

Upvotes: 2

Related Questions