Apparently random vector plotting: TSNE

Question

I have successfully created a vector model using Gensim's word2vec library. Distance between related vectors is good (that is to say that derived similarity makes sense from a human perspective).

However, attempting to map these vectors to a graph has proven challenging. Naturally the N dimensions of the vectors needs to be reduced in order to be plot-able: and to that end I've used TSNE.

import gensim, logging, os
import codecs
import numpy as np
import matplotlib.pyplot as plt
import gensim, logging, os
from sklearn.manifold import TSNE

wvs = model.syn1neg
vocabulary = model.wv.vocab


tsne = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y = tsne.fit_transform(wvs[::])

plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(vocabulary, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()

However, the resulting points relating to the vectors appear to be essentially random - there is just a single huge cluster with a couple of outliers.

A case in point: note the nearest neighbours to "hallucinating" at the edge of the cluster

But the actual vectors returned usng model.most_similar() are

[('agitated', 0.7707732319831848), ('restless', 0.740711510181427), ('disorientated', 0.7242116332054138), ('confused', 0.7215688228607178), ('aggressive', 0.71 69249057769775), ('drowsy', 0.6654224395751953), ('tearful', 0.6573441624641418) , ('aggitated', 0.6566967964172363), ('sleepy', 0.6562871932983398), ('shaking', 0.6419488191604614)]

What way can I begin to approach this to try and make the output be more sensible?

Apparently random vector plotting: TSNE

Answers (1)

Related Questions