Reputation: 161
I am implementing a deep neural network, and initializing the weights using a pre-training algorithm based on restricted boltzmann machines. However, when I increase the number of hidden layers, the performance decreases also (from e.g. 43% to 41%).
I have around 26K samples which I use for pre-training, and my input feature dimension is 98. I have tried several architectures, with different number of hidden nodes per layer (10, 50, 100) and 1 and 2 hidden layers.
I have researched the literature, and the only reason for the decrease of performance when adding layers is attributed to the bad initialisation. However, this shouldn't apply here since I am doing pre-training.
What do you think is the cause of the performance decline, is it something related to the way I do pre-training, or the insufficient amount of data? If you could provide some scientific papers as references it would be awesome.
What would you recommend me to do to fix this problem?
[Edit]
This blogpost gives a nice overview of some important architectures and how they deal with the above-mentioned problem : https://towardsdatascience.com/an-intuitive-guide-to-deep-network-architectures-65fdc477db41
Upvotes: 2
Views: 2503
Reputation: 1064
This should be a result of vanishing gradient. the more you add to the hidden layers, the less significant a change is accounted for
Upvotes: 0
Reputation: 990
I've added more layers for an example of MNIST test with tensorflow. But I got very bad result. So it is not correct that more layers neural network means better prediction or higher accuracy. The following is my test code for MNIST example on tensorflow:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import sys
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Import data
data_dir='/tmp/tensorflow/mnist/input_data'
mnist = input_data.read_data_sets(data_dir, one_hot=True)
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 784*2]))
b = tf.Variable(tf.zeros([784*2]))
x2= tf.matmul(x, W)+b
#reluX= tf.nn.relu(x2)
W2 = tf.Variable(tf.zeros([784*2, 10]))
b2 = tf.Variable(tf.zeros([10]))
#y = tf.matmul(reluX, W2) + b2
y = tf.matmul(x2, W2) + b2
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
# The raw formulation of cross-entropy,
#
# tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(y)),
# reduction_indices=[1]))
#
# can be numerically unstable.
#
# So here we use tf.nn.softmax_cross_entropy_with_logits on the raw
# outputs of 'y', and then average across the batch.
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
#train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(1000)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.train.images,
y_: mnist.train.labels}))
Upvotes: 0
Reputation: 10995
Most likely linked to the pre-training since that is the mechanic allowing you to train multiple layers in the first place. I'm also not sure what exactly your training algorithm is? You say your pre-training is based on RBM but just to be sure, your net is a Deep Belief Network (DBN)?
If so, there's a high number of things that you could have done wrong but I'd highly recommend observing the gradients of the layers over time. If they decay or explode once of your deep learning methods isn't working. I'd also try working on much simpler data to confirm that you can sucessfully learn simple functions like XOR, sin and the likes with multiple layers to exclude the data as the source of error.
Finally, it's worth noting that it's not an actual rule of thumb that "more layers = better performance" (for DBMs specifically, see here), in fact a multilayer perceptron with one larger layer might perform better (partly related to the universal approximation theorem)
Upvotes: 0