Rameau
Rameau

Reputation: 43

Gensim Doc2Vec visualization issue when using t-SNE and/or PCA

I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation.

doc_tags = list(doc2vec_model.docvecs.doctags.keys())
print(doc_tags)
X = doc2vec_model[doc_tags]
print(X)
['animation', 'fantasy', 'comedy', 'action', 'romance', 'sci-fi']
[[ -0.6630892    0.20754902   0.2949621    0.622197     0.15592825]
 [ -1.0809666    0.64607996   0.3626246    0.9261689    0.31883526]
 [ -2.3482993    2.410015     0.86162883   3.0468733   -0.3903969 ]
 [ -1.7452248    0.25237766   0.6007084    2.2371168    0.9400951 ]
 [ -1.9570891    1.3037877   -0.24805197   1.6109428   -0.3572465 ]
 [-15.548988    -4.129228     3.608777    -0.10240117   3.2107658 ]]

print(doc2vec_model.docvecs.most_similar('romance'))
[('comedy', 0.6839742660522461), ('animation', 0.6497607827186584), ('fantasy', 0.5627620220184326), ('sci-fi', 0.14199887216091156), ('action', 0.046558648347854614)]

"Romance" and "comedy" are fairly similar, while "action" and "sci-fi" are fairly dissimilar genres compared to "romance". So far so good. However, in order to visualize the results, I need to reduce the vector dimensionality. Therefore, I try first t-SNE and then PCA. This is the code and the results:

# TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
df = pd.DataFrame(X_tsne, index=doc_tags, columns=['x', 'y'])
print(df)
                    x           y
animation -162.499695   74.153679
fantasy    -10.496888   93.687149
comedy     -38.886723  -56.914558
action     -76.036247  232.218231
romance    101.005371  198.827988
sci-fi     123.960182   20.141081

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
df_1 = pd.DataFrame(X_pca, index=doc_tags, columns=['x', 'y'])
print(df_1)
                   x         y
animation  -3.060287 -1.474442
fantasy    -2.815175 -0.888522
comedy     -2.520171  2.244404
action     -2.063809 -0.191137
romance    -2.578774  0.370727
sci-fi     13.038214 -0.061030

There is something wrong. This is even more visible when I visualize the results:

TSNE:

t-SNE result

PCA:

PCA result

This is clearly not what the model has produced. I am sure I am missing something basic. If you have any suggestions, it would be greatly appreciated.

Upvotes: 4

Views: 2705

Answers (2)

gojomo
gojomo

Reputation: 54183

First, you're always going to lose some qualities of the full-dimensionality model when doing a 2D projection, as required for such visualizations. You just hope – & try to choose appropriate methods/parameters – that the important aspects are preserved. So there isn't necessarily anything 'wrong' when a particular visualization disappoints.

And especially with high-dimensional 'dense embeddings' like with word2vec/doc2vec, there's way more info in the full embedding than can be shown in the 2D projection. You may see some sensible micro-relationships in such a plot – close neighbors in a few places matching expectations – but the overall 'map' won't be nearly as interpretable as, well, a real map of a truly 2D surface.

But also: it looks like you're creating a 30-dimensional Doc2Vec model with only 6 document-tags. Because of the way Doc2Vec works, if there are only 5 unique tags, it's essentially the case that you're training on only 5 virtual documents, just chopped up into different fragments. It's as if you took all the 'comedy' reviews and concatenated them into one big doc, and the same with all the 'romance' reviews, etc.

For many uses of Doc2Vec, and particularly in the published papers that introduced the underlying 'Paragraph Vector' algorithm, it is more typical to use each document's unique ID as its 'tag', especially since many downstream uses then need a doc-vector per document, rather than per-known-category. This may better preserve/model information in the original data - whereas collapsing everything to just 6 mega-documents, and 6 summary tag-vectors, imposes more simple implied category-shapes.

Note that if using unique IDs as tags, you won't automatically wind up with one summary tag-vector per category that you can read from the model. But, you could synthesize such a vector, perhaps by simply averaging the vectors of all the docs in a certain category to get that category's centroid.

It's still sometimes valuable to use known-labels as document tags, either instead-of unique IDs (as you've done here), or in-addition-to unique IDs (using the option of more-than-one tag per training document).

But you should know using known-labels, and only known-labels, as tags can be limiting. (For example, if you instead trained a separate vector per document, you could then plot the docs via your visualization, and color the dots with known-labels, and see which categories tend to have large overlaps, and highlight certain datapoints that seem to challenge the categories, or have nearest-neighbors in a different category.)

Upvotes: 3

Tinu
Tinu

Reputation: 2513

  • t-SNE embedding: it is a common mistake to think that distances between points (or clusters) in the embedded space is proportional to the distance in the original space. This is a major drawback of t-SNE, for more information see here. Therefore you shouldn't draw any conclusions from the visualization.

  • PCA embedding: PCA corresponds to a rotation of the coordinate system into a new orthogonal coordinate system which optimally describes the variance of the data. When keeping all principal components the (euclidean) distances are preserved, however when reducing the dimension (e.g. to 2D) the points will be projected onto the axis with most variance and the distances might not correspond to the original distances anymore. Again it is difficult to draw conclusions about the distances of the points in the original space from the embedding.

Upvotes: 2

Related Questions