Reputation: 763
I have implemented a pretty simple Deep Neural Network to perform multi-label classification. The overview of the model is (bias omitted for the sake of simple visualization):
That is, a 3-layer deep neural network with ReLU units and Sigmoid as output unit.
The loss function is Sigmoid Cross Entropy and the used optimizer is Adam.
When I train this NN without Dropout I get the following results:
#Placeholders
x = tf.placeholder(tf.float32,[None,num_features],name='x')
y = tf.placeholder(tf.float32,[None,num_classes],name='y')
keep_prob = tf.placeholder(tf.float32,name='keep_prob')
#Layer1
WRelu1 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu1')
bRelu1 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu1')
layer1 = tf.add(tf.matmul(x,WRelu1),bRelu1,name='layer1')
relu1 = tf.nn.relu(layer1,name='relu1')
#Layer2
WRelu2 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu2')
bRelu2 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu2')
layer2 = tf.add(tf.matmul(relu1,WRelu2),bRelu2,name='layer2')
relu2 = tf.nn.relu(layer2,name='relu2')
#Layer3
WRelu3 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu3')
bRelu3 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu3')
layer3 = tf.add(tf.matmul(relu2,WRelu3),bRelu3,name='layer3')
relu3 = tf.nn.relu(tf.matmul(relu2,WRelu3) + bRelu3,name='relu3')
#Out layer
Wout = tf.Variable(tf.truncated_normal([num_features,num_classes],stddev=1.0),dtype=tf.float32,name='wout')
bout = tf.Variable(tf.zeros([num_classes]),dtype=tf.float32,name='bout')
logits = tf.add(tf.matmul(relu3,Wout),bout,name='logits')
#Predictions
logits_sigmoid = tf.nn.sigmoid(logits,name='logits_sigmoid')
#Cost & Optimizer
cost = tf.losses.sigmoid_cross_entropy(y,logits)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cost)
Evaluation results on test data:
ROC AUC - micro average: 0.6474180196222774
ROC AUC - macro average: 0.6261438437099212
Precision - micro average: 0.5112489722699753
Precision - macro average: 0.48922193879411413
Precision - weighted average: 0.5131092162035961
Recall - micro average: 0.584640369246549
Recall - macro average: 0.55746897003228
Recall - weighted average: 0.584640369246549
When I train this NN adding Dropout layers I get the following results:
#Placeholders
x = tf.placeholder(tf.float32,[None,num_features],name='x')
y = tf.placeholder(tf.float32,[None,num_classes],name='y')
keep_prob = tf.placeholder(tf.float32,name='keep_prob')
#Layer1
WRelu1 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu1')
bRelu1 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu1')
layer1 = tf.add(tf.matmul(x,WRelu1),bRelu1,name='layer1')
relu1 = tf.nn.relu(layer1,name='relu1')
#DROPOUT
relu1 = tf.nn.dropout(relu1,keep_prob=keep_prob,name='relu1drop')
#Layer2
WRelu2 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu2')
bRelu2 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu2')
layer2 = tf.add(tf.matmul(relu1,WRelu2),bRelu2,name='layer2')
relu2 = tf.nn.relu(layer2,name='relu2')
#DROPOUT
relu2 = tf.nn.dropout(relu2,keep_prob=keep_prob,name='relu2drop')
#Layer3
WRelu3 = tf.Variable(tf.truncated_normal([num_features,num_features],stddev=1.0),dtype=tf.float32,name='wrelu3')
bRelu3 = tf.Variable(tf.zeros([num_features]),dtype=tf.float32,name='brelu3')
layer3 = tf.add(tf.matmul(relu2,WRelu3),bRelu3,name='layer3')
relu3 = tf.nn.relu(tf.matmul(relu2,WRelu3) + bRelu3,name='relu3')
#DROPOUT
relu3 = tf.nn.dropout(relu3,keep_prob=keep_prob,name='relu3drop')
#Out layer
Wout = tf.Variable(tf.truncated_normal([num_features,num_classes],stddev=1.0),dtype=tf.float32,name='wout')
bout = tf.Variable(tf.zeros([num_classes]),dtype=tf.float32,name='bout')
logits = tf.add(tf.matmul(relu3,Wout),bout,name='logits')
#Predictions
logits_sigmoid = tf.nn.sigmoid(logits,name='logits_sigmoid')
#Cost & Optimizer
cost = tf.losses.sigmoid_cross_entropy(y,logits)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cost)
Evaluation results on test data:
ROC AUC - micro average: 0.5
ROC AUC - macro average: 0.5
Precision - micro average: 0.34146163499985405
Precision - macro average: 0.34146163499985405
Precision - weighted average: 0.3712475781926326
Recall - micro average: 1.0
Recall - macro average: 1.0
Recall - weighted average: 1.0
As you can see with the Recall values in the Dropout version, the NN output is always 1, always positive class for every class of every sample.
It's true that it's not an easy problem, but after applying Dropout I expected at least similar results as without Dropout, not worse result and of course not this saturated output.
Why could be this happening? How could I avoid this behaviour? Do you see something strange or bad done in the code?
Hyperparameters:
Dropout rate: 0.5 @ training / 1.0 @ inference
Epochs: 500
Learning rate: 0.0001
Dataset information:
Number of instances: +22.000
Number of classes: 6
Thanks!
Upvotes: 0
Views: 249
Reputation: 763
Finally I've managed to solve my own question with some more experimentation, so this is what I figured out.
I exported the Tensorboad graph and weights, bias and activations data in order to explore them on TB.
Then I realized that something wasn't going fine with the weights.
As you could observe, the weights wasn't changing at all. In other words, the layer "wasn't learning" anything.
But then the exaplanation was in front of my eyes. The distribution of the weights was too broad. Look at that histogram range, from [-2,2] that's too much.
Then I realized that I were initializing the weights matrices with
truncated_normal(mean=0.0, std=1.0)
which is a really high std.dev for a correct init. The obvious trick was to initialize the weights with a more correct initialization. Then, I chose the "Xavier Glorot Initialization" and then the weights becomes to this:
And the predictions stopped being all positive to became mixed predictions again. And, of course, with a better performance on test set thanks to the Dropout.
In summary, the net without Dropout was capable of learn something with that too broad initialization, but the net with Dropout wasn't, and needed a better initialization in order to not get stuck.
Thanks to all the people that have read the post and contribute with a comment.
Upvotes: 0