rzrshr
rzrshr

Reputation: 95

Binary classifier always returns 0.5

I am training a classifier that takes a RGB input (so three 0 to 255 values) and returns whether black or white (0 or 1) font would fit best with that colour. After training, my classifier always returns 0.5 (or there about) and never gets any more accurate than that.

The code is below:

import tensorflow as tf
import numpy as np
from tqdm import tqdm

print('Creating Datasets:')

x_train = []
y_train = []

for i in tqdm(range(10000)):
    x_train.append([np.random.uniform(0, 255), np.random.uniform(0, 255), np.random.uniform(0, 255)])

for elem in tqdm(x_train):
    if (((elem[0] + elem[1] + elem[2]) / 3) / 255) > 0.5:
        y_train.append(0)
    else:
        y_train.append(1)

x_train = np.array(x_train)
y_train = np.array(y_train)

graph = tf.Graph()

with graph.as_default():

    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)

    w_1 = tf.Variable(tf.random_normal([3, 10], stddev=1.0), tf.float32)
    b_1 = tf.Variable(tf.random_normal([10]), tf.float32)
    l_1 = tf.sigmoid(tf.matmul(x, w_1) + b_1)

    w_2 = tf.Variable(tf.random_normal([10, 10], stddev=1.0), tf.float32)
    b_2 = tf.Variable(tf.random_normal([10]), tf.float32)
    l_2 = tf.sigmoid(tf.matmul(l_1, w_2) + b_2)

    w_3 = tf.Variable(tf.random_normal([10, 5], stddev=1.0), tf.float32)
    b_3 = tf.Variable(tf.random_normal([5]), tf.float32)
    l_3 = tf.sigmoid(tf.matmul(l_2, w_3) + b_3)

    w_4 = tf.Variable(tf.random_normal([5, 1], stddev=1.0), tf.float32)
    b_4 = tf.Variable(tf.random_normal([1]), tf.float32)
    y_ = tf.sigmoid(tf.matmul(l_3, w_4) + b_4)

    loss = tf.reduce_mean(tf.squared_difference(y, y_))

    optimizer = tf.train.AdadeltaOptimizer().minimize(loss)

    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())

        print('Training:')

        for step in tqdm(range(5000)):
            index = np.random.randint(0, len(x_train) - 129)
            feed_dict = {x : x_train[index:index+128], y : y_train[index:index+128]}
            sess.run(optimizer, feed_dict=feed_dict)
            if step % 1000 == 0:
                print(sess.run([loss], feed_dict=feed_dict))

        while True:
            inp1 = int(input(''))
            inp2 = int(input(''))
            inp3 = int(input(''))
            print(sess.run(y_, feed_dict={x : [[inp1, inp2, inp3]]}))

As you can see, I start by importing the modules I will be using. Next I generate my input x dataset and desired output y dataset. The x_train dataset consists of 10000 random RGB values, while the y_train dataset consists of 0's and 1's, with a 1 corresponding to an RGB value with a mean lower than 128 and a 0 corresponding to an RGB value with a mean higher than 128 (this ensures bright backgrounds get dark font and vice versa).

My neural net is admittedly overly complex (or so i assume), but as far as I am aware it is a pretty standard feed forward net, with an Adadelta optimiser and the default learning rate.

The training of the net is normal as far as my limited knowledge informs me, but nonetheless the model always spits out 0.5.

The last block of code allows the user to input values and see what they turn into when passed to the neural net.

I have messed around with different activation functions, losses, methods of initialising biases etc. But to no avail. Some times when I tinker with the code, the model always returns 1 or 0 respectively, but this is still just as inaccurate as being indecisive and returning 0.5 over and over. I have not been able to find a suitable solution to my problem online. Any advice or suggestions are welcome.

Edit:

The loss, weights, biases and the output don't change much over the course of training (the weights and biases only change by hundredths and thousandths every 1000 iterations, and the loss fluctuates around 0.3). Also, the output sometimes varies f depending on the input (as you would expect), but other times is constant. One run of the program lead to constant 0.7's as output, while another always returned 0.5 apart from very near to zero, where it returned 0.3 or 0.4 type values. Neither of the aforementioned are the desired output. What should happen is that (255, 255, 255) should map to 0 and (0, 0, 0) should map to 1 and (128, 128, 128) should map to either 1 or 0, as in the middle the font colour doesn't really matter.

Upvotes: 4

Views: 4425

Answers (2)

Max Weinzierl
Max Weinzierl

Reputation: 1261

The largest issue was that you were using mean squared error as your loss function on a classification problem. The cross-entropy loss function is much more suited for this kind of problem.

Here's a visualization of the difference between the cross-entropy loss function and the mean squared error loss function:

MSE vs Cross-Entropy Loss Source: Wolfram Alpha

Notice how the loss increases asymptotically as the model gets further from the correct prediction (in this case 1). This curvature provides a much stronger gradient signal during backpropogation while also satisfying many important theoretical probability distribution distance (divergence) properties. By minimizing the cross-entropy loss you are actually also minimizing the KL divergence between your model's prediction distribution and the training data label distribution. You can read more about the cross-entropy loss function here: http://colah.github.io/posts/2015-09-Visual-Information/

I also tweaked a few other things to make the code better and make the model easier to modify. This should solve all your problems:

import tensorflow as tf
import numpy as np
from tqdm import tqdm

# define a random seed for (somewhat) reproducible results:
seed = 0
np.random.seed(seed)
print('Creating Datasets:')

# much faster dataset creation
x_train = np.random.uniform(low=0, high=255, size=[10000, 3])
# easier label creation
# if the average color is greater than half the color space than use black, otherwise use white
# classes:
# white = 0
# black = 1
y_train = ((np.mean(x_train, axis=1) / 255.0) > 0.5).astype(int)

# now transform dataset to be within range [-1, 1] instead of [0, 255] 
# for numeric stability and quicker model training
x_train = (2 * (x_train / 255)) - 1

graph = tf.Graph()

with graph.as_default():
    # must do this within graph scope
    tf.set_random_seed(seed)
    # specify input dims for clarity
    x = tf.placeholder(tf.float32, shape=[None, 3])
    # y is now integer label [0 or 1]
    y = tf.placeholder(tf.int32, shape=[None])
    # use relu, usually better than sigmoid 
    activation_fn = tf.nn.relu
    # from https://arxiv.org/abs/1502.01852v1
    initializer = tf.initializers.variance_scaling(
        scale=2.0, 
        mode='fan_in',
        distribution='truncated_normal')
    # better api to reduce clutter
    l_1 = tf.layers.dense(
        x,
        10,
        activation=activation_fn,
        kernel_initializer=initializer)
    l_2 = tf.layers.dense(
        l_1,
        10,
        activation=activation_fn,
        kernel_initializer=initializer)
    l_3 = tf.layers.dense(
        l_2,
        5,
        activation=activation_fn,
        kernel_initializer=initializer)
    y_logits = tf.layers.dense(
        l_3,
        2,
        activation=None,
        kernel_initializer=initializer)

    y_ = tf.nn.softmax(y_logits)
    # much better loss function for classification
    loss = tf.reduce_mean(
        tf.losses.sparse_softmax_cross_entropy(
            labels=y, 
            logits=y_logits))
    # much better default optimizer for new problems
    # good learning rate, but probably can tune
    optimizer = tf.train.AdamOptimizer(
        learning_rate=0.01)
    # seperate train op for easier calling
    train_op = optimizer.minimize(loss)

    # tell tensorflow not to allocate all gpu memory at start
    config = tf.ConfigProto()
    config.gpu_options.allow_growth=True
    with tf.Session(config=config) as sess:

        sess.run(tf.global_variables_initializer())

        print('Training:')

        for step in tqdm(range(5000)):
            index = np.random.randint(0, len(x_train) - 129)
            feed_dict = {x : x_train[index:index+128], 
                         y : y_train[index:index+128]}
            # can train and get loss in single run, much more efficient
            _, b_loss = sess.run([train_op, loss], feed_dict=feed_dict)
            if step % 1000 == 0:
                print(b_loss)

        while True:
            inp1 = int(input('Enter R pixel color: '))
            inp2 = int(input('Enter G pixel color: '))
            inp3 = int(input('Enter B pixel color: '))
            # scale to model train range [-1, 1]
            model_input = (2 * (np.array([inp1, inp2, inp3], dtype=float) / 255.0)) - 1
            if (model_input >= -1).all() and (model_input <= 1).all():
                # y_ is now two probabilities (white_prob, black_prob) but they will sum to 1.
                white_prob, black_prob = sess.run(y_, feed_dict={x : [model_input]})[0]
                print('White prob: {:.2f} Black prob: {:.2f}'.format(white_prob, black_prob))
            else:
                print('Values not within [0, 255]!')

I documented my changes with comments, but let me know if you have any questions! I ran this on my end and it worked perfectly:

Creating Datasets:
2018-10-05 00:50:59.156822: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-10-05 00:50:59.411003: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 8.00GiB freeMemory: 6.60GiB
2018-10-05 00:50:59.417736: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-10-05 00:51:00.109351: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-05 00:51:00.113660: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971]      0
2018-10-05 00:51:00.118545: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0:   N
2018-10-05 00:51:00.121605: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6370 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Training:
  0%|                                                                                         | 0/5000 [00:00<?, ?it/s]0.6222609
 19%|██████████████▋                                                               | 940/5000 [00:01<00:14, 275.57it/s]0.013466636
 39%|██████████████████████████████                                               | 1951/5000 [00:02<00:04, 708.07it/s]0.0067519126
 59%|█████████████████████████████████████████████▊                               | 2971/5000 [00:04<00:02, 733.24it/s]0.0028143923
 79%|████████████████████████████████████████████████████████████▌                | 3935/5000 [00:05<00:01, 726.36it/s]0.0073514087
100%|█████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:07<00:00, 698.32it/s]
Enter R pixel color: 1
Enter G pixel color: 1
Enter B pixel color: 1
White prob: 1.00 Black prob: 0.00
Enter R pixel color: 255
Enter G pixel color: 255
Enter B pixel color: 255
White prob: 0.00 Black prob: 1.00
Enter R pixel color: 128
Enter G pixel color: 128
Enter B pixel color: 128
White prob: 0.08 Black prob: 0.92
Enter R pixel color: 126
Enter G pixel color: 126
Enter B pixel color: 126
White prob: 0.99 Black prob: 0.01

Upvotes: 1

xdurch0
xdurch0

Reputation: 10474

Two things I see from looking at your network:

  1. Sigmoid activation in the hidden layers is usually a bad choice. The sigmoid function saturates for large (positive or negative) inputs, resulting in the gradient becoming smaller and smaller as it is backpropagated through the networks. This is commonly referred to as the "vanishing gradient" problem. It could be that the gradient for variables near the output is "healthy" and thus the upper layers are learning, however if the lower layers don't receive any gradient they will simply keep returning random values that the higher layers can't work with. You could try replacing the sigmoid activations with e.g. tf.nn.relu. Sigmoid in the output layer is okay (and kinda necessary if you want your outputs to be 0/1), however consider using cross entropy instead of squared error as a loss function instead.
  2. Your weight initialization likely results in excessively large weights. Standard deviation of 1.0 is way too high. This can lead to numerical issues as well as saturating the activations even more (since due to the large weights you can expect to have large activation values from the start). Try something like an std of 0.1, and consider using truncated_normal instead to prevent outliers (or use a uniform random initalization).

It's difficult to say whether this will fix your issues, however I believe both of these are things you should definitely change about your network as it is right now.

Upvotes: 3

Related Questions