Are PyTorch activation functions best stored as fields?

Question

An example of a simple neural network in PyTorch can be found at https://visualstudiomagazine.com/articles/2020/10/14/pytorch-define-network.aspx

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 8)  # 4-(8-8)-1
    self.hid2 = T.nn.Linear(8, 8)
    self.oupt = T.nn.Linear(8, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x)) 
    z = T.tanh(self.hid2(z))
    z = T.sigmoid(self.oupt(z))
    return z

A distinctive feature of the above is that the layers are stored as fields within the Net object (as they need to be, in the sense that they contain the weights, which need to be remembered across training epochs), but the activation functors such as tanh are re-created on every call to forward. The author says:

The most common structure for a binary classification network is to define the network layers and their associated weights and biases in the __init__() method, and the input-output computations in the forward() method.

Fair enough. On the other hand, perhaps it would be marginally faster to store the functors rather than re-create them on every call to forward. On the third hand, it's unlikely to make any measurable difference, which means it might end up being a matter of code style.

Is the above, indeed the most common way to do it? Does either way have any technical advantage, or is it just a matter of style?

KonstantinosKokos · Accepted Answer

On "storing" functors

The snippet is not "re-creating" anything -- calling torch.tanh(x) is literally just calling the function tanh exported by the torch package with arguments x.

Other ways of doing it

I think the snippet is a fair example for small neural blocks that are use-and-forget or are just not meant to be parameterizable. Depending on your intentions, there are of course alternatives, but you'd have to weigh yourself whether the added complexity offers any value.

activation functions as strings

allow a selection of an activation function from a fixed set

class Model(torch.nn.Module):
   def __init__(..., activation_function: Literal['tanh'] | Literal['relu']):
       ...
       if activation_function == 'tanh':
           self.activation_function = torch.tanh
       elif activation_function == 'relu':
           self.activation_function = torch.relu
       else: 
           raise ValueError(f'activation function {activation_function} not allowed, use tanh or relu.'}
      

    def forward(...) -> Tensor:
        output = ...
        return self.activation_function(output)

activation functions as callables

use arbitrary modules or functions as activations

class Model(torch.nn.Module):
   def __init__(..., activation_function: torch.nn.Module | Callable[[Tensor], Tensor]):
      self.activation_function = activation_function

   def forward(...) -> Tensor:
       output = ...
       return self.activation_function(output)

which would for instance work like

def cube(x: Tensor) -> Tensor: return x**3

cubic_model = Model(..., activation_function=cube)

The key difference between the above examples and your snippet is the fact that the latter are transparent and adjustable wrt. to the activation used; you can inspect the activation function (i.e. model.activation_function), and change it (before or after initialization), whereas in the case of the original snippet it is invisible and baked into the model's functionality (to replicate the model with a different function, you'd need to define it from scratch).

Overall, I think the best way to go is to create small, locally tunable blocks that are as parametric as you need them to be, and wrap them into bigger blocks that make generalizations over the contained parameters. i.e. if your big model consists of 5 linear layers, you could make a single, activation-parametric wrapper for 1 layer (including dropouts, layer norms, whatever), and then another wrapper for a flow of N layers, which asks once for which activation function to initialize its children with. In other words, generalize and parameterize when you anticipate this to save you from extra effort and copy-pasting code in the future, but don't overdo it or you'll end up far away from your original specifications and needs.

ps: I don't know whether calling activation functions functors is justifiable.

Are PyTorch activation functions best stored as fields?

Answers (1)

On "storing" functors

Other ways of doing it

Related Questions