I. A
I. A

Reputation: 2312

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.

I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?

Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?

Your help is much appreciated!!

Upvotes: 3

Views: 913

Answers (1)

I. A
I. A

Reputation: 2312

So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.

The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.

In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.

All the details are discussed in this blog post.

But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:

final_output = output * mask

Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.

Upvotes: 2

Related Questions