OldDew
OldDew

Reputation: 67

50% accuracy in CNN on image binary classification

I have a collection of images with open and closed eyes.
The data is collected from the current directory using keras in this way:

batch_size = 64
N_images = 84898 #total number of images
datagen = ImageDataGenerator(
    rescale=1./255)
data_iterator = datagen.flow_from_directory(
    './Eyes',
    shuffle = 'False',
    color_mode='grayscale',
    target_size=(h, w),
    batch_size=batch_size,
    class_mode = 'binary')

I've got a .csv file with the state of each eye.

I've built this Sequential model:

num_filters = 8
filter_size = 3
pool_size = 2

model = Sequential([
  Conv2D(num_filters, filter_size, input_shape=(90, 90, 1)),
  MaxPooling2D(pool_size=pool_size),
  Flatten(),
  Dense(16, activation='relu'),
  Dense(2, activation='sigmoid'), # Two classes. one for "open" and another one for "closed"
])

Model compilation.

model.compile(
    'adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Finally I fit all the data with the following:

model.fit(
  train_images,
  to_categorical(train_labels),
  epochs=3,
  validation_data=(test_images, to_categorical(test_labels)),
)

The result fluctuates around 50% and I do not understand why.

Upvotes: 0

Views: 1323

Answers (1)

DerekG
DerekG

Reputation: 3958

Your current model essentially has one convolutional layer. That is, num_filters convolutional filters (which in this case are 3 x 3 arrays) are defined and fit such that when they are convolved with the image, they produce features that are as discriminative as possible between classes. You then perform maxpooling to slightly reduce the dimension of the output CNN features before passing to 2 dense layers.

I'd start by saying that one convolutional layer is almost certainly insufficient, especially with 3x3 filters. Basically, with a single convolutional layer, the most meaningful information you can get are edges or lines. These features are only marginally more useful to a function approximator (i.e. your fully connected layers) than the raw pixel intensity values because they still have an extremely high degree of variability both within a class and between classes. Consider that shifting an image of an eye 2 pixels to the left would result in completely different values output from your 1-layer CNN. You'd like the outputs of your CNN to be invariant to scale, rotation, illumination, etc.

In practice, this means you're going to need more convolutional layers. The relatively simple VGG net has at least 14 convolutional layers, and modern residual-layer based networks often have over 100 convolutional layers. Try writing a routine to define sequentially more complex networks until you start seeing performance gains.

As a secondary point, generally you don't want to use a sigmoid() activation function on your final layer outputs during training. This flattens the gradients and makes it much slower to backpropogate your loss. You actually don't care that the output values fall between 0 and 1, you only care about their relative magnitudes. Common practice is to use cross entropy loss which combines a log softmax function (gradient more stable than normal softmax) and negative log likelihood loss, as you've already done. Thus, since the log softmax portion transforms the output values into the desired range, there's no need to use the sigmoid activation function.

Upvotes: 1

Related Questions