Reputation: 55
This code is from PyTorch transformer:
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.norm3 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
self.dropout3 = Dropout(dropout)
Why do they add self.dropout1
, ...2
, ...3
when self.dropout
already exists and is the exact same function?
Also, what is the difference between (self.linear1
, self.linear2
) and self.linear
?
Upvotes: 5
Views: 2172
Reputation: 8527
In the case of Dropout
, reusing the layer should not usually be an issue. So you could create a single self.dropout = Dropout(dropout)
layer and call it multiple times in the forward
function. But there may be subtle use cases which would behave differently when you do this, such as if you iterate across layers in a network for some reason. This thread, and particularly this post, discuss this in some detail.
For the linear layer, each Linear
object is characterized by a set of weights and biases. If you call it multiple times in the forward
function, all the calls will share and optimize the same set of weights. This can have legitimate uses, but is not appropriate when you want multiple linear layers, each with its own set of weights and biases.
Upvotes: 3
Reputation: 2357
That's because to separate one Linear layer or Dropout layer from one another. That's very simple logic. You are creating different instances or layers in the network of the Dropout function using self.dropout = Dropout(dropout)
.
Upvotes: 0