Foobar
Foobar

Reputation: 8477

Why are embeddings necessary in Pytorch?

I (think) I understand the basic principle behind embeddings: they're a shortcut to quickly perform the operation (one hot encoded vector) * matrix without actually performing that operation (by utilizing the fact that this operation is equivalent to indexing into the matrix) while still maintaining the same gradient as if that operation was actually performed.

I know in the following example:

e = Embedding(3, 2)
n = e(torch.LongTensor([0, 2]))

n will be a tensor of shape 2, 2.

But, we could also do:

p = nn.Parameter(torch.zeros([3, 2]).normal_(0, 0.01))
p[tensor([0, 2])]

and get the same result without an embedding.

This in and of itself wouldn't be confusing since in the first example n has a grad_fn called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative.

The confusing part is in chapter 8 of the fastbook they use embeddings to compute movie recommendations. But then, they do it without embeddings in basically the same manner & the model still works. I would expect the version without embeddings to fail because the derivative would be incorrect.

Version with embeddings:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

Version without:

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

Does anyone know what is going on?

Upvotes: 2

Views: 593

Answers (1)

Szymon Maszke
Szymon Maszke

Reputation: 24726

Besides the points provided in the comments:

I (think) I understand the basic principle behind embeddings: they're a shortcut to quickly perform the operation (one hot encoded vector) * matrix without actually performing that operation

This one is only partially correct. Indexing into matrix is fully differentiable operation, it's just taking part of the data and propagate gradient this path only. Multiplication here would be wasteful and unnecessary.

and get the same result without an embedding.

This one is true

called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative. (emphasis mine)

This one isn't (or often is not). Embedding also chooses part of the data, just like you did, grad_fn is different due to "extra functionalities", but in principle it is the same (choosing some vector(s) from the matrix), think of it like IndexBackward on steroids.

The confusing part is in chapter 8 of the fastbook they use embeddings to compute movie recommendations. But then, they do it without embeddings in basically the same manner & the model still works. I would expect the version without embeddings to fail because the derivative would be incorrect. (emphasis mine)

In this exact case, both approaches are equivalent (give or take different results from random initialization).

Why nn.Embedding at all?

  • Easier to understand your intent when reading the code
  • Common utilities and functionality added if needed (as pointed out in the comments)

Upvotes: 1

Related Questions