Reputation: 473

Pytorch: how to add L1 regularizer to activations?

I would like to add the L1 regularizer to the activations output from a ReLU. More generally, how does one add a regularizer only to a particular layer in the network?

^{Related material:

This similar post refers to adding L2 regularization, but it appears to add the regularization penalty to all layers of the network.

nn.modules.loss.L1Loss() seems relevant, but I do not yet understand how to use this.

The legacy module L1Penalty seems relevant also, but why has it been deprecated?}

Upvotes: 37

Answers (6)

bitspersecond

Reputation: 148

Some of the answers above are missing a vital piece of information:

L1 or L2 regularization of a vector of parameters IS NOT THE SAME AS NORM of that vector of respective order.
It is NOT WRONG to apply different regularization to different layers or even on selected model parameters only. It is only a best practice to apply it consistently on all model parameters. For example, what if I want first layer of my model to be more interpretable than the second? Given that,

Given the above reasons,

the statement in @rainy's answer:

l1_regularization += torch.norm(param, 1)**2

should be modified to:

l1_regularization += torch.norm(param, 1)

because norm of order 1 wasn't square-root(ed) and **2 would make the term as |w|^2, which is not even l1 or l2 :P

The statement in @sasank's answer:

l2_regularization = lambda2 * torch.norm(all_linear2_params, 2)

should be modified to:

l2_regularization = lambda2 * torch.norm(all_linear2_params, 2)**2

because norm of order 2 is square-root(ed) and l2 regularization should be without square-root, although adding a square root version would simply scale down the gradients.

Upvotes: 0

ndronen

Reputation: 1012

All of the (other current) responses are incorrect in some way as the question is about adding regularization to activation. This one is closest in that it suggests summing the norms of the outputs, which is correct, but the code sums the norms of the weights, which is incorrect.

The correct way is not to modify the network code, but rather to capture the outputs via a forward hook, as in the OutputHook class. From there, the summing of the norms of the outputs is straightforward, but one needs to take care to clear the captured outputs every iteration.

import torch


class OutputHook(list):
    """ Hook to capture module outputs.
    """
    def __call__(self, module, input, output):
        self.append(output)


class MLP(torch.nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.linear1 = torch.nn.Linear(128, 32)
        self.linear2 = torch.nn.Linear(32, 16)
        self.linear3 = torch.nn.Linear(16, 2)
        # Instantiate ReLU, so a hook can be registered to capture its output.
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        layer1_out = self.relu(self.linear1(x))
        layer2_out = self.relu(self.linear2(layer1_out))
        out = self.linear3(layer2_out)
        return out


batch_size = 4
l1_lambda = 0.01

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Register hook to capture the ReLU outputs. Non-trivial networks will often
# require hooks to be applied more judiciously.
output_hook = OutputHook()
model.relu.register_forward_hook(output_hook)

inputs = torch.rand(batch_size, 128)
targets = torch.ones(batch_size).long()

optimizer.zero_grad()
outputs = model(inputs)
cross_entropy_loss = torch.nn.functional.cross_entropy(outputs, targets)

# Compute the L1 penalty over the ReLU outputs captured by the hook.
l1_penalty = 0.
for output in output_hook:
    l1_penalty += torch.norm(output, 1)
l1_penalty *= l1_lambda

loss = cross_entropy_loss + l1_penalty
loss.backward()
optimizer.step()
output_hook.clear()

Upvotes: 22

iacob

Reputation: 24371

You can apply L1 regularization of the weights of a single layer of your model my_layer to the loss function with the following code:

def l1_penalty(params, l1_lambda=0.001):
    """Returns the L1 penalty of the params."""
    l1_norm = sum(p.abs().sum() for p in params)
    return l1_lambda*l1_norm

loss = loss_fn(outputs, labels) + l1_penalty(my_layer.parameters())

Upvotes: 1

rainy

Reputation: 111

@Sasank Chilamkurthy Regularization should be the weighting parameter of each layer of the model, not the output of each layer. please look below: Regularization

import torch
from torch.autograd import Variable
from torch.nn import functional as F


class MLP(torch.nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.linear1 = torch.nn.Linear(128, 32)
        self.linear2 = torch.nn.Linear(32, 16)
        self.linear3 = torch.nn.Linear(16, 2)
    def forward(self, x):
        layer1_out = F.relu(self.linear1(x))
        layer2_out = F.relu(self.linear2(layer1_out))
        out = self.linear3(layer2_out)
        return out

batchsize = 4
lambda1, lambda2 = 0.5, 0.01

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

inputs = Variable(torch.rand(batchsize, 128))
targets = Variable(torch.ones(batchsize).long())
l1_regularization, l2_regularization = torch.tensor(0), torch.tensor(0)

optimizer.zero_grad()
outputs = model(inputs)
cross_entropy_loss = F.cross_entropy(outputs, targets)
for param in model.parameters():
    l1_regularization += torch.norm(param, 1)**2
    l2_regularization += torch.norm(param, 2)**2

loss = cross_entropy_loss + l1_regularization + l2_regularization
loss.backward()
optimizer.step()

Upvotes: 7

Tethys

Reputation: 31

I think the original post wants to regularize the output from ReLU, so the regularizer should be on the output, not the weights of the network. They are not the same!

with l1-norm regularize the weights is training a neural network has sparse weights
with l1-norm regularize the output of a layer is training a network has a sparse output of this certain layer.

Either these above answers (including the accepted one) missed the point, or I misunderstanding the original post question.

Upvotes: 3

Sasank Chilamkurthy

Reputation: 1080

Here is how you do this:

In your Module's forward return final output and layers' output for which you want to apply L1 regularization
loss variable will be sum of cross entropy loss of output w.r.t. targets and L1 penalties.

Here's an example code

import torch
from torch.autograd import Variable
from torch.nn import functional as F


class MLP(torch.nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.linear1 = torch.nn.Linear(128, 32)
        self.linear2 = torch.nn.Linear(32, 16)
        self.linear3 = torch.nn.Linear(16, 2)

    def forward(self, x):
        layer1_out = F.relu(self.linear1(x))
        layer2_out = F.relu(self.linear2(layer1_out))
        out = self.linear3(layer2_out)
        return out, layer1_out, layer2_out

batchsize = 4
lambda1, lambda2 = 0.5, 0.01

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

# usually following code is looped over all batches 
# but let's just do a dummy batch for brevity

inputs = Variable(torch.rand(batchsize, 128))
targets = Variable(torch.ones(batchsize).long())

optimizer.zero_grad()
outputs, layer1_out, layer2_out = model(inputs)
cross_entropy_loss = F.cross_entropy(outputs, targets)

all_linear1_params = torch.cat([x.view(-1) for x in model.linear1.parameters()])
all_linear2_params = torch.cat([x.view(-1) for x in model.linear2.parameters()])
l1_regularization = lambda1 * torch.norm(all_linear1_params, 1)
l2_regularization = lambda2 * torch.norm(all_linear2_params, 2)

loss = cross_entropy_loss + l1_regularization + l2_regularization
loss.backward()
optimizer.step()

Upvotes: 43

Pytorch: how to add L1 regularizer to activations?

Answers (6)

Related Questions