Why DNN with Dropout always predict one?

Question

I have implemented a pretty simple Deep Neural Network to perform multi-label classification. The overview of the model is (bias omitted for the sake of simple visualization):

$Model$

That is, a 3-layer deep neural network with ReLU units and Sigmoid as output unit.

The loss function is Sigmoid Cross Entropy and the used optimizer is Adam.

When I train this NN without Dropout I get the following results:

    #Placeholders
    x = tf.placeholder(tf.float32,[None,num_features],name='x')
    y = tf.placeholder(tf.float32,[None,num_classes],name='y')

    keep_prob = tf.placeholder(tf.float32,name='keep_prob')

    #Layer1
    WRelu1 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu1')
    bRelu1 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu1')
    layer1 = tf.add(tf.matmul(x,WRelu1),bRelu1,name='layer1')
    relu1 = tf.nn.relu(layer1,name='relu1')

    #Layer2
    WRelu2 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu2')
    bRelu2 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu2')
    layer2 = tf.add(tf.matmul(relu1,WRelu2),bRelu2,name='layer2')
    relu2 = tf.nn.relu(layer2,name='relu2')

    #Layer3
    WRelu3 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu3')
    bRelu3 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu3')
    layer3 = tf.add(tf.matmul(relu2,WRelu3),bRelu3,name='layer3')
    relu3 = tf.nn.relu(tf.matmul(relu2,WRelu3) + bRelu3,name='relu3')

    #Out layer
    Wout = tf.Variable(tf.truncated_normal([num_features,num_classes],stddev=1.0),dtype=tf.float32,name='wout')
    bout = tf.Variable(tf.zeros([num_classes]),dtype=tf.float32,name='bout')
    logits = tf.add(tf.matmul(relu3,Wout),bout,name='logits')

    #Predictions
    logits_sigmoid = tf.nn.sigmoid(logits,name='logits_sigmoid')

    #Cost & Optimizer
    cost = tf.losses.sigmoid_cross_entropy(y,logits)
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cost)

Evaluation results on test data:

ROC AUC - micro average: 0.6474180196222774
ROC AUC - macro average: 0.6261438437099212

Precision - micro average: 0.5112489722699753
Precision - macro average: 0.48922193879411413
Precision - weighted average: 0.5131092162035961

Recall - micro average: 0.584640369246549
Recall - macro average: 0.55746897003228
Recall - weighted average: 0.584640369246549

When I train this NN adding Dropout layers I get the following results:

    #Placeholders
    x = tf.placeholder(tf.float32,[None,num_features],name='x')
    y = tf.placeholder(tf.float32,[None,num_classes],name='y')

    keep_prob = tf.placeholder(tf.float32,name='keep_prob')

    #Layer1
    WRelu1 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu1')
    bRelu1 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu1')
    layer1 = tf.add(tf.matmul(x,WRelu1),bRelu1,name='layer1')
    relu1 = tf.nn.relu(layer1,name='relu1')

    #DROPOUT
    relu1 = tf.nn.dropout(relu1,keep_prob=keep_prob,name='relu1drop')

    #Layer2
    WRelu2 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu2')
    bRelu2 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu2')
    layer2 = tf.add(tf.matmul(relu1,WRelu2),bRelu2,name='layer2')
    relu2 = tf.nn.relu(layer2,name='relu2')

    #DROPOUT
    relu2 = tf.nn.dropout(relu2,keep_prob=keep_prob,name='relu2drop')

    #Layer3
    WRelu3 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu3')
    bRelu3 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu3')
    layer3 = tf.add(tf.matmul(relu2,WRelu3),bRelu3,name='layer3')
    relu3 = tf.nn.relu(tf.matmul(relu2,WRelu3) + bRelu3,name='relu3')


    #DROPOUT
    relu3 = tf.nn.dropout(relu3,keep_prob=keep_prob,name='relu3drop')

    #Out layer
    Wout = tf.Variable(tf.truncated_normal([num_features,num_classes],stddev=1.0),dtype=tf.float32,name='wout')
    bout = tf.Variable(tf.zeros([num_classes]),dtype=tf.float32,name='bout')
    logits = tf.add(tf.matmul(relu3,Wout),bout,name='logits')

    #Predictions
    logits_sigmoid = tf.nn.sigmoid(logits,name='logits_sigmoid')


    #Cost & Optimizer
    cost = tf.losses.sigmoid_cross_entropy(y,logits)
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cost)

Evaluation results on test data:

ROC AUC - micro average: 0.5
ROC AUC - macro average: 0.5

Precision - micro average: 0.34146163499985405
Precision - macro average: 0.34146163499985405
Precision - weighted average: 0.3712475781926326

Recall - micro average: 1.0
Recall - macro average: 1.0
Recall - weighted average: 1.0

As you can see with the Recall values in the Dropout version, the NN output is always 1, always positive class for every class of every sample.

It's true that it's not an easy problem, but after applying Dropout I expected at least similar results as without Dropout, not worse result and of course not this saturated output.

Why could be this happening? How could I avoid this behaviour? Do you see something strange or bad done in the code?

Hyperparameters:

Dropout rate: 0.5 @ training / 1.0 @ inference

Epochs: 500

Learning rate: 0.0001

Dataset information:

Number of instances: +22.000

Number of classes: 6

Thanks!

Alber8295 · Accepted Answer

Finally I've managed to solve my own question with some more experimentation, so this is what I figured out.

I exported the Tensorboad graph and weights, bias and activations data in order to explore them on TB.

Then I realized that something wasn't going fine with the weights.

As you could observe, the weights wasn't changing at all. In other words, the layer "wasn't learning" anything.

But then the exaplanation was in front of my eyes. The distribution of the weights was too broad. Look at that histogram range, from [-2,2] that's too much.

Then I realized that I were initializing the weights matrices with

truncated_normal(mean=0.0, std=1.0)

which is a really high std.dev for a correct init. The obvious trick was to initialize the weights with a more correct initialization. Then, I chose the "Xavier Glorot Initialization" and then the weights becomes to this:

And the predictions stopped being all positive to became mixed predictions again. And, of course, with a better performance on test set thanks to the Dropout.

In summary, the net without Dropout was capable of learn something with that too broad initialization, but the net with Dropout wasn't, and needed a better initialization in order to not get stuck.

Thanks to all the people that have read the post and contribute with a comment.

Why DNN with Dropout always predict one?

Answers (1)

Related Questions