Reputation: 1122
From tensorflow documentation, I see there are a few ways of applying L1 regularisation. The first is the most intuitive to me. This example behaves as expected, d1 has all 3's, which sum up to 48 and scaled by 0.1 we get 4.8 as the loss.
d1 = tf.ones(shape=(2,2,4))*3
regularizer = tf.keras.regularizers.l1(0.1)
regularizer(d1)
tf.Tensor: shape=(), dtype=float32, numpy=4.8
In the second way, we use the regularisation on kernels. So I'm guessing it encourages sparsity of model weights. I can't exactly tell how the loss is 0.54747146.
layer = tf.keras.layers.Dense(3,input_dim=(2,2,4),kernel_regularizer=tf.keras.regularizers.l1(0.1))
out = layer(d1)
layer.losses
tf.Tensor: shape=(), dtype=float32, numpy=0.54747146
The third way I believed should have given the same result as the first way of applying regularisation directly to the layer. Here we use activity_regularizer
: Regularizer to apply a penalty on the layer's output.
layer2 = tf.keras.layers.Dense(3,input_dim=(2,2,4),activity_regularizer=tf.keras.regularizers.l1(0.1))
out2=layer2(d1)
layer2.losses
tf.Tensor: shape=(), dtype=float32, numpy=1.4821562
** The value returned by the activity_regularizer
is divided by the input
batch size...
Why is the loss 1.4821562? It seems to be different every time I repeat. How do the third and first ways differ?
If I want to encourage sparsity of d1, which should I use?
Upvotes: 1
Views: 458
Reputation: 91
What your dense layer is calculating is a matrix product
y = W x + b
. Your three different ways of applying L1 calculate:
l1(x)
l1(W)
l1(Wx + b)
Since the weights and biases are randomly generated, they will be different for each run unless you specify a fixed seed.
Upvotes: 1