Moran Reznik
Moran Reznik

Reputation: 1371

PyTorch transformer argument "dim_feedforward"

I would like to understand what exactly is going on with this argument.

  1. I have read that the feed forward sub-layer inside the transformer layer is a "pointwise" feed-forward layer. what does "pointwise" means in this context?

  2. feed-forward layers takes 2 args: input features and output features. this argument can't be the output features since no matter what value I use for it the output of the transformer layer always has the same shape. it also can't be the input features since it is determined by the self attention sublayer.

  3. MOST IMPORTANTLY - where is the argument for the size of the tensors for the attention? the ones that translate the input into queries, keys and values?

Upvotes: 2

Views: 3870

Answers (1)

Li Jiangxin
Li Jiangxin

Reputation: 66

  1. "Position-wise", or "Point-wise", means the feed forward network (FFN) takes each position of a sequence, say, each word of a sentence, as its input. So point-wise FFN is a shared FFN that inputs each word one by one.
  2. (and 3.) That's right. It is neither input features (determined by the self attention sublayer) nor output features (the same value as input features). It is actually the hidden features. The thing is, this particular FFN in transformer encoder has two linear layers, according to the implementation of TransformerEncoderLayer :
    # Implementation of Feedforward model
    self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
    self.dropout = Dropout(dropout)
    self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)

So dim_feedforward is the feature no. of hidden layer of the FFN. Usually, its value is set to be several times larger than d_model (2048 as default).

Upvotes: 5

Related Questions