user1927468
user1927468

Reputation: 1133

Embedding in pytorch

Does Embedding make similar words closer to each other? And do I just need to give to it all the sentences? Or it is just a lookup table and I need to code the model?

Upvotes: 81

Views: 130277

Answers (5)

Prajot Kuvalekar
Prajot Kuvalekar

Reputation: 6618

One more trick to get word-wise embedding from weight

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()



vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)


    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds

model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

Upvotes: 0

Garima Jain
Garima Jain

Reputation: 855

torch.nn.Embedding just creates a Lookup Table, to get the word embedding given a word index.

from collections import Counter
import torch.nn as nn

# Let's say you have 2 sentences(lowercased, punctuations removed) :
sentences = "i am new to PyTorch i am having fun"

words = sentences.split(' ')
    
vocab = Counter(words) # create a dictionary
vocab = sorted(vocab, key=vocab.get, reverse=True)
vocab_size = len(vocab)

# map words to unique indices
word2idx = {word: ind for ind, word in enumerate(vocab)} 

# word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}

encoded_sentences = [word2idx[word] for word in words]

# encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]

# let's say you want embedding dimension to be 3
emb_dim = 3 

Now, embedding layer can be initialized as :

emb_layer = nn.Embedding(vocab_size, emb_dim)
word_vectors = emb_layer(torch.LongTensor(encoded_sentences))

This initializes embeddings from a standard Normal distribution(that is 0 mean and unit variance). Thus, these word vectors don't have any sense of 'relatedness'.

word_vectors is a torch tensor of size (9,3). (since there are 9 words in our data)

emb_layer has one trainable parameter called weight, which is, by default, set to be trained. You can check it by :

emb_layer.weight.requires_grad

which returns True. If you don't want to train your embeddings during model training(say, when you are using pre-trained embeddings), you can set them to False by :

emb_layer.weight.requires_grad = False

If your vocabulary size is 10,000 and you wish to initialize embeddings using pre-trained embeddings(of dim 300), say, Word2Vec, do it as :

emb_layer = nn.Embedding(10000, 300)
emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})

here, emb_mat is a Numpy matrix of size (10,000, 300) containing 300-dimensional Word2vec word vectors for each of the 10,000 words in your vocabulary.

Now, the embedding layer is loaded with Word2Vec word representations.

Upvotes: 28

AveryLiu
AveryLiu

Reputation: 839

You could treat nn.Embedding as a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it you should specify the size of the lookup table, and initialize the word vectors yourself. Following is a code example demonstrating this.

import torch.nn as nn 

# vocab_size is the number of words in your train, val and test set
# vector_size is the dimension of the word vectors you are using
embed = nn.Embedding(vocab_size, vector_size)

# intialize the word vectors, pretrained_weights is a 
# numpy array of size (vocab_size, vector_size) and 
# pretrained_weights[i] retrieves the word vector of
# i-th word in the vocabulary
embed.weight.data.copy_(torch.fromnumpy(pretrained_weights))

# Then turn the word index into actual word vector
vocab = {"some": 0, "words": 1}
word_indexes = [vocab[w] for w in ["some", "words"]] 
word_vectors = embed(word_indexes)

Upvotes: 48

prosti
prosti

Reputation: 46291

Agh! I think this part is still missing. Showcasing that when you set the embedding layer you automatically get the weights, that you may later alter with nn.Embedding.from_pretrained(weight)

import torch
import torch.nn as nn

embedding = nn.Embedding(10, 4)
print(type(embedding))
print(embedding)

t1 = embedding(torch.LongTensor([0,1,2,3,4,5,6,7,8,9])) # adding, 10 won't work
print(t1.shape)
print(t1)


t2 = embedding(torch.LongTensor([1,2,3]))
print(t2.shape)
print(t2)

#predefined weights
weight = torch.FloatTensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])
print(weight.shape)
embedding = nn.Embedding.from_pretrained(weight)
# get embeddings for ind 0 and 1
embedding(torch.LongTensor([0, 1]))

Output:

<class 'torch.nn.modules.sparse.Embedding'>
Embedding(10, 4)
torch.Size([10, 4])
tensor([[-0.7007,  0.0169, -0.9943, -0.6584],
        [-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836],
        [ 0.4950, -1.4879,  0.4768,  0.4148],
        [ 0.0826, -0.7024,  1.2711,  0.7964],
        [-2.0595,  2.1670, -0.1599,  2.1746],
        [-2.5193,  0.6946, -0.0624, -0.1500],
        [ 0.5307, -0.7593, -1.7844,  0.1132],
        [-0.0371, -0.5854, -1.0221,  2.3451]], grad_fn=<EmbeddingBackward>)
torch.Size([3, 4])
tensor([[-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836]], grad_fn=<EmbeddingBackward>)
torch.Size([2, 3])

tensor([[0.1000, 0.2000, 0.3000],
        [0.4000, 0.5000, 0.6000]])

And the last part is that the Embedding layer weights can be learned with the gradient descent.

Upvotes: 8

Escachator
Escachator

Reputation: 1881

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e. of the size of the vocabulary x the dimension of each vector embedding, and a method that does the lookup.

When you create an embedding layer, the Tensor is initialised randomly. It is only when you train it when this similarity between similar words should appear. Unless you have overwritten the values of the embedding with a previously trained model, like GloVe or Word2Vec, but that's another story.

So, once you have the embedding layer defined, and the vocabulary defined and encoded (i.e. assign a unique number to each word in the vocabulary) you can use the instance of the nn.Embedding class to get the corresponding embedding.

For example:

import torch
from torch import nn
embedding = nn.Embedding(1000,128)
embedding(torch.LongTensor([3,4]))

will return the embedding vectors corresponding to the word 3 and 4 in your vocabulary. As no model has been trained, they will be random.

Upvotes: 100

Related Questions