Reputation: 43
I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation.
doc_tags = list(doc2vec_model.docvecs.doctags.keys())
print(doc_tags)
X = doc2vec_model[doc_tags]
print(X)
['animation', 'fantasy', 'comedy', 'action', 'romance', 'sci-fi']
[[ -0.6630892 0.20754902 0.2949621 0.622197 0.15592825]
[ -1.0809666 0.64607996 0.3626246 0.9261689 0.31883526]
[ -2.3482993 2.410015 0.86162883 3.0468733 -0.3903969 ]
[ -1.7452248 0.25237766 0.6007084 2.2371168 0.9400951 ]
[ -1.9570891 1.3037877 -0.24805197 1.6109428 -0.3572465 ]
[-15.548988 -4.129228 3.608777 -0.10240117 3.2107658 ]]
print(doc2vec_model.docvecs.most_similar('romance'))
[('comedy', 0.6839742660522461), ('animation', 0.6497607827186584), ('fantasy', 0.5627620220184326), ('sci-fi', 0.14199887216091156), ('action', 0.046558648347854614)]
"Romance" and "comedy" are fairly similar, while "action" and "sci-fi" are fairly dissimilar genres compared to "romance". So far so good. However, in order to visualize the results, I need to reduce the vector dimensionality. Therefore, I try first t-SNE and then PCA. This is the code and the results:
# TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
df = pd.DataFrame(X_tsne, index=doc_tags, columns=['x', 'y'])
print(df)
x y
animation -162.499695 74.153679
fantasy -10.496888 93.687149
comedy -38.886723 -56.914558
action -76.036247 232.218231
romance 101.005371 198.827988
sci-fi 123.960182 20.141081
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
df_1 = pd.DataFrame(X_pca, index=doc_tags, columns=['x', 'y'])
print(df_1)
x y
animation -3.060287 -1.474442
fantasy -2.815175 -0.888522
comedy -2.520171 2.244404
action -2.063809 -0.191137
romance -2.578774 0.370727
sci-fi 13.038214 -0.061030
There is something wrong. This is even more visible when I visualize the results:
TSNE:
PCA:
This is clearly not what the model has produced. I am sure I am missing something basic. If you have any suggestions, it would be greatly appreciated.
Upvotes: 4
Views: 2705
Reputation: 54183
First, you're always going to lose some qualities of the full-dimensionality model when doing a 2D projection, as required for such visualizations. You just hope – & try to choose appropriate methods/parameters – that the important aspects are preserved. So there isn't necessarily anything 'wrong' when a particular visualization disappoints.
And especially with high-dimensional 'dense embeddings' like with word2vec/doc2vec, there's way more info in the full embedding than can be shown in the 2D projection. You may see some sensible micro-relationships in such a plot – close neighbors in a few places matching expectations – but the overall 'map' won't be nearly as interpretable as, well, a real map of a truly 2D surface.
But also: it looks like you're creating a 30-dimensional Doc2Vec
model with only 6 document-tags. Because of the way Doc2Vec
works, if there are only 5 unique tags, it's essentially the case that you're training on only 5 virtual documents, just chopped up into different fragments. It's as if you took all the 'comedy' reviews and concatenated them into one big doc, and the same with all the 'romance' reviews, etc.
For many uses of Doc2Vec
, and particularly in the published papers that introduced the underlying 'Paragraph Vector' algorithm, it is more typical to use each document's unique ID as its 'tag', especially since many downstream uses then need a doc-vector per document, rather than per-known-category. This may better preserve/model information in the original data - whereas collapsing everything to just 6 mega-documents, and 6 summary tag-vectors, imposes more simple implied category-shapes.
Note that if using unique IDs as tags, you won't automatically wind up with one summary tag-vector per category that you can read from the model. But, you could synthesize such a vector, perhaps by simply averaging the vectors of all the docs in a certain category to get that category's centroid.
It's still sometimes valuable to use known-labels as document tags, either instead-of unique IDs (as you've done here), or in-addition-to unique IDs (using the option of more-than-one tag
per training document).
But you should know using known-labels, and only known-labels, as tags can be limiting. (For example, if you instead trained a separate vector per document, you could then plot the docs via your visualization, and color the dots with known-labels, and see which categories tend to have large overlaps, and highlight certain datapoints that seem to challenge the categories, or have nearest-neighbors in a different category.)
Upvotes: 3
Reputation: 2513
t-SNE embedding: it is a common mistake to think that distances between points (or clusters) in the embedded space is proportional to the distance in the original space. This is a major drawback of t-SNE, for more information see here. Therefore you shouldn't draw any conclusions from the visualization.
PCA embedding: PCA corresponds to a rotation of the coordinate system into a new orthogonal coordinate system which optimally describes the variance of the data. When keeping all principal components the (euclidean) distances are preserved, however when reducing the dimension (e.g. to 2D) the points will be projected onto the axis with most variance and the distances might not correspond to the original distances anymore. Again it is difficult to draw conclusions about the distances of the points in the original space from the embedding.
Upvotes: 2