machine-learningtensorflowdeep-learningtext-classification

Reputation: 85

Setting up a MLP for binary classification with tensorflow

I have some troubles trying to set up a multilayer perceptron for binary classification using tensorflow.

I have a very large dataset (about 1,5*10^6 examples) each with a binary (0/1) label and 100 features. What I need to do is to set up a simple MLP and then try to change the learning rate and the initialization pattern to document the results (it's an assignment). I am getting strange results, though, as my MLP seem to get stuck with a low-but-not-great cost early and never getting off of it. With fairly low values of learning rate the cost goes NAN almost immediately. I don't know if the problem lies in how I structured the MLP (I did a few tries, going to post the code for the last one) or if I am missing something with my tensorflow implementation.

CODE

import tensorflow as tf
import numpy as np
import scipy.io

# Import and transform dataset
print("Importing dataset.")
dataset = scipy.io.mmread('tfidf_tsvd.mtx')

with open('labels.txt') as f:
    all_labels = f.readlines()

all_labels = np.asarray(all_labels)
all_labels = all_labels.reshape((1498271,1))

# Split dataset into training (66%) and test (33%) set
training_set    = dataset[0:1000000]
training_labels = all_labels[0:1000000]
test_set        = dataset[1000000:1498272]
test_labels     = all_labels[1000000:1498272]

print("Dataset ready.") 

# Parameters
learning_rate   = 0.01 #argv
mini_batch_size = 100
training_epochs = 10000
display_step    = 500

# Network Parameters
n_hidden_1  = 64    # 1st hidden layer of neurons
n_hidden_2  = 32    # 2nd hidden layer of neurons
n_hidden_3  = 16    # 3rd hidden layer of neurons
n_input     = 100   # number of features after LSA

# Tensorflow Graph input
x = tf.placeholder(tf.float64, shape=[None, n_input], name="x-data")
y = tf.placeholder(tf.float64, shape=[None, 1], name="y-labels")

print("Creating model.")

# Create model
def multilayer_perceptron(x, weights):
    # First hidden layer with SIGMOID activation
    layer_1 = tf.matmul(x, weights['h1'])
    layer_1 = tf.nn.sigmoid(layer_1)
    # Second hidden layer with SIGMOID activation
    layer_2 = tf.matmul(layer_1, weights['h2'])
    layer_2 = tf.nn.sigmoid(layer_2)
    # Third hidden layer with SIGMOID activation
    layer_3 = tf.matmul(layer_2, weights['h3'])
    layer_3 = tf.nn.sigmoid(layer_3)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_2, weights['out'])
    return out_layer

# Layer weights, should change them to see results
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], dtype=np.float64)),       
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], dtype=np.float64)),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],dtype=np.float64)),
    'out': tf.Variable(tf.random_normal([n_hidden_2, 1], dtype=np.float64))
}

# Construct model
pred = multilayer_perceptron(x, weights)

# Define loss and optimizer
cost = tf.nn.l2_loss(pred-y,name="squared_error_cost")
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.initialize_all_variables()

print("Model ready.")

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    print("Starting Training.")

    # Training cycle
    for epoch in range(training_epochs):
        #avg_cost = 0.
        # minibatch loading
        minibatch_x = training_set[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        minibatch_y = training_labels[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        # Run optimization op (backprop) and cost op
        _, c = sess.run([optimizer, cost], feed_dict={x: minibatch_x, y: minibatch_y})

        # Compute average loss
        avg_cost = c / (minibatch_x.shape[0])

        # Display logs per epoch
        if (epoch) % display_step == 0:
        print("Epoch:", '%05d' % (epoch), "Training error=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    # Calculate accuracy
    test_error = tf.nn.l2_loss(pred-y,name="squared_error_test_cost")/test_set.shape[0]
    print("Test Error:", test_error.eval({x: test_set, y: test_labels}))

OUTPUT

python nn.py
Importing dataset.
Dataset ready.
Creating model.
Model ready.
Starting Training.
Epoch: 00000 Training error= 0.331874878
Epoch: 00500 Training error= 0.121587482
Epoch: 01000 Training error= 0.112870921
Epoch: 01500 Training error= 0.110293652
Epoch: 02000 Training error= 0.122655269
Epoch: 02500 Training error= 0.124971940
Epoch: 03000 Training error= 0.125407845
Epoch: 03500 Training error= 0.131942481
Epoch: 04000 Training error= 0.121696954
Epoch: 04500 Training error= 0.116669835
Epoch: 05000 Training error= 0.129558477
Epoch: 05500 Training error= 0.122952110
Epoch: 06000 Training error= 0.124655344
Epoch: 06500 Training error= 0.119827300
Epoch: 07000 Training error= 0.125183779
Epoch: 07500 Training error= 0.156429254
Epoch: 08000 Training error= 0.085632880
Epoch: 08500 Training error= 0.133913128
Epoch: 09000 Training error= 0.114762624
Epoch: 09500 Training error= 0.115107805
Optimization Finished!
Test Error: 0.116647016708

This is what MMN advised

weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=0, dtype=np.float64)),     
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], stddev=0.01, dtype=np.float64)),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],  stddev=0.01, dtype=np.float64)),
    'out': tf.Variable(tf.random_normal([n_hidden_2, 1], dtype=np.float64))
}

This is the output

Epoch: 00000 Training error= 0.107566668
Epoch: 00500 Training error= 0.289380907
Epoch: 01000 Training error= 0.339091784
Epoch: 01500 Training error= 0.358559815
Epoch: 02000 Training error= 0.122639698
Epoch: 02500 Training error= 0.125160135
Epoch: 03000 Training error= 0.126219718
Epoch: 03500 Training error= 0.132500418
Epoch: 04000 Training error= 0.121795254
Epoch: 04500 Training error= 0.116499476
Epoch: 05000 Training error= 0.124532673
Epoch: 05500 Training error= 0.124484790
Epoch: 06000 Training error= 0.118491177
Epoch: 06500 Training error= 0.119977633
Epoch: 07000 Training error= 0.127532511
Epoch: 07500 Training error= 0.159053519
Epoch: 08000 Training error= 0.083876224
Epoch: 08500 Training error= 0.131488483
Epoch: 09000 Training error= 0.123161189
Epoch: 09500 Training error= 0.125011362
Optimization Finished!
Test Error: 0.129284643093

Connected third hidden layer, thanks to MMN

There was a mistake in my code and I had two hidden layers instead of three. I corrected doing:

'out': tf.Variable(tf.random_normal([n_hidden_3, 1], dtype=np.float64))

and

out_layer = tf.matmul(layer_3, weights['out'])

I returned to the old value for stddev though, as it seems to cause less fluctuation in the cost function.

The output is still troubling

Epoch: 00000 Training error= 0.477673073
Epoch: 00500 Training error= 0.121848744
Epoch: 01000 Training error= 0.112854530
Epoch: 01500 Training error= 0.110597624
Epoch: 02000 Training error= 0.122603499
Epoch: 02500 Training error= 0.125051472
Epoch: 03000 Training error= 0.125400717
Epoch: 03500 Training error= 0.131999354
Epoch: 04000 Training error= 0.121850889
Epoch: 04500 Training error= 0.116551533
Epoch: 05000 Training error= 0.129749704
Epoch: 05500 Training error= 0.124600464
Epoch: 06000 Training error= 0.121600218
Epoch: 06500 Training error= 0.121249676
Epoch: 07000 Training error= 0.132656938
Epoch: 07500 Training error= 0.161801757
Epoch: 08000 Training error= 0.084197352
Epoch: 08500 Training error= 0.132197409
Epoch: 09000 Training error= 0.123249055
Epoch: 09500 Training error= 0.126602369
Optimization Finished!
Test Error: 0.129230736355

Two more changes thanks to Steven So Steven proposed to change Sigmoid activation function with ReLu, and so I tried. In the mean time, I noticed I didn't set an activation function for the output node, so I did that too (should be easy to see what I changed).

Starting Training.
Epoch: 00000 Training error= 293.245977809
Epoch: 00500 Training error= 0.290000000
Epoch: 01000 Training error= 0.340000000
Epoch: 01500 Training error= 0.360000000
Epoch: 02000 Training error= 0.285000000
Epoch: 02500 Training error= 0.250000000
Epoch: 03000 Training error= 0.245000000
Epoch: 03500 Training error= 0.260000000
Epoch: 04000 Training error= 0.290000000
Epoch: 04500 Training error= 0.315000000
Epoch: 05000 Training error= 0.285000000
Epoch: 05500 Training error= 0.265000000
Epoch: 06000 Training error= 0.340000000
Epoch: 06500 Training error= 0.180000000
Epoch: 07000 Training error= 0.370000000
Epoch: 07500 Training error= 0.175000000
Epoch: 08000 Training error= 0.105000000
Epoch: 08500 Training error= 0.295000000
Epoch: 09000 Training error= 0.280000000
Epoch: 09500 Training error= 0.285000000
Optimization Finished!
Test Error: 0.220196439287

This is what it does with the Sigmoid activation function on every node, output included

Epoch: 00000 Training error= 0.110878121
Epoch: 00500 Training error= 0.119393080
Epoch: 01000 Training error= 0.109229532
Epoch: 01500 Training error= 0.100436962
Epoch: 02000 Training error= 0.113160662
Epoch: 02500 Training error= 0.114200962
Epoch: 03000 Training error= 0.109777990
Epoch: 03500 Training error= 0.108218725
Epoch: 04000 Training error= 0.103001394
Epoch: 04500 Training error= 0.084145737
Epoch: 05000 Training error= 0.119173495
Epoch: 05500 Training error= 0.095796251
Epoch: 06000 Training error= 0.093336573
Epoch: 06500 Training error= 0.085062860
Epoch: 07000 Training error= 0.104251661
Epoch: 07500 Training error= 0.105910949
Epoch: 08000 Training error= 0.090347288
Epoch: 08500 Training error= 0.124480612
Epoch: 09000 Training error= 0.109250224
Epoch: 09500 Training error= 0.100245836
Optimization Finished!
Test Error: 0.110234139674

I found these numbers very strange, in the first case, it is stuck in a higher cost than sigmoid, even though sigmoid should saturate very early. In the second case, it starts with a training error which is almost the last one... so it basically converges with one mini-batch. I'm starting to think that I am not calculating the cost correctly, in this line: avg_cost = c / (minibatch_x.shape[0])

Upvotes: 4

Answers (3)

Pramod Patil

Reputation: 827

Along with above answers, I will suggest you that try a cost function tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)

As binary classification, you must try the sigmoid_cross_entropy_with_logits cost function

I will also suggest you must also plot line graph of accuracy of train and test vs number of epochs. i.e check whether the model is overfitting?

If its not overfitting, try to make your neural net more complex. That is by increasing number of neurons, increasing number of layers. You will get such a point beyond that your training accuracy will keep increasing but validation will not that point will give the best model.

Upvotes: 1

Steven

Reputation: 5162

So it could be a couple of things:

You could be saturating the sigmoid units (as MMN mentioned) I would suggest trying relu units instead.

replace:

tf.nn.sigmoid(layer_n)

with:

tf.nn.relu(layer_n)

Your model may not have the expressive power to actually learn your data. I.e. it would need to be deeper.
You can also try a different optimizer like Adam() as such

replace:

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with:

optimizer = tf.train.AdamOptimizer().minimize(cost)

A few other points:

You should add a bias term to your weights

like so:

biases = {
 'b1': tf.Variable(tf.random_normal([n_hidden_1],   dtype=np.float64)),       
 'b2': tf.Variable(tf.random_normal([n_hidden_2], dtype=np.float64)),
 'b3': tf.Variable(tf.random_normal([n_hidden_3],dtype=np.float64)),
 'bout': tf.Variable(tf.random_normal([1], dtype=np.float64))
 }

def multilayer_perceptron(x, weights):
    # First hidden layer with SIGMOID activation
    layer_1 = tf.matmul(x, weights['h1']) + biases['b1']
    layer_1 = tf.nn.sigmoid(layer_1)
    # Second hidden layer with SIGMOID activation
    layer_2 = tf.matmul(layer_1, weights['h2']) + biases['b2']
    layer_2 = tf.nn.sigmoid(layer_2)
    # Third hidden layer with SIGMOID activation
    layer_3 = tf.matmul(layer_2, weights['h3']) + biases['b3']
    layer_3 = tf.nn.sigmoid(layer_3)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['bout']
    return out_layer

and you can update the learning rate over time

like so:

    learning_rate = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
                                           global_step,
                                           decay_steps,
                                           LEARNING_RATE_DECAY_FACTOR,
                                           staircase=True)

You just need to define the decay steps i.e. when to decay and LEARNING_RATE_DECAY_FACTOR i.e. decay by how much.

Upvotes: 2

MMN

Reputation: 676

Your weights at initialized with a stddev of 1, so the output of layer 1 will have a stddev of 10 or so. This might be saturating the sigmoid functions to the point most gradients are 0.

Can you try initializing the hidden weights with a stddev of .01?

Upvotes: 1

Setting up a MLP for binary classification with tensorflow

CODE

OUTPUT

Answers (3)

Related Questions