PCA on word2vec embeddings using pre existing model

Question

I have a word2vec model trained on Tweets. I also have a list of words, and I need to get the embeddings from the words, compute the first two principal components, and plot each word on a 2 dimensional space.

I'm trying to follow tutorials such as this one: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

However in all such tutorials, they create a model based on a random sentence they use and then calculate PCA on all the words in the model. I don't want to do that, I only want to calculate and plot specific words. How can I use the model that I already have, which has thousands of words, and compute the first two principal components for a set list of words I have (around 20)?

So like in the link above, they have "model" with only the words from the sentence they wrote. And then they do "X = model[model.wv.vocab]", then "pca.fit_transform(X)". If I were to copy this code, I would do a PCA on the huge model, which I don't want to do. I just want to extract the embeddings of some words from that model and then compute PCA on those few words. Hopefully this makes sense, thanks in advance. Please let me know if I need to clarify anything.

Captain Trojan · Accepted Answer

Create a collection with the same structure (a dictionary) as

model.wv.vocab,

fill it with your target words, and compute PCA.

You can do this using the following code:

my_vocab = {}
for w in my_words:
    my_vocab[w] = model.vw.vocab[w]

X = model[my_vocab]
pca.fit_transform(X)

PCA on word2vec embeddings using pre existing model

Answers (1)

Related Questions