Reputation: 1421
I'm trying to implement a neural network that classifies images into one of the two discrete categories. The problem is, however, that it currently always predicts 0 for any input and I'm not really sure why.
Here's my feature extraction method:
def extract(file):
# Resize and subtract mean pixel
img = cv2.resize(cv2.imread(file), (224, 224)).astype(np.float32)
img[:, :, 0] -= 103.939
img[:, :, 1] -= 116.779
img[:, :, 2] -= 123.68
# Normalize features
img = (img.flatten() - np.mean(img)) / np.std(img)
return np.array([img])
Here's my gradient descent routine:
def fit(x, y, t1, t2):
"""Training routine"""
ils = x.shape[1] if len(x.shape) > 1 else 1
labels = len(set(y))
if t1 is None or t2 is None:
t1 = randweights(ils, 10)
t2 = randweights(10, labels)
params = np.concatenate([t1.reshape(-1), t2.reshape(-1)])
res = grad(params, ils, 10, labels, x, y)
params -= 0.1 * res
return unpack(params, ils, 10, labels)
Here are my forward and back(gradient) propagations:
def forward(x, theta1, theta2):
"""Forward propagation"""
m = x.shape[0]
# Forward prop
a1 = np.vstack((np.ones([1, m]), x.T))
z2 = np.dot(theta1, a1)
a2 = np.vstack((np.ones([1, m]), sigmoid(z2)))
a3 = sigmoid(np.dot(theta2, a2))
return (a1, a2, a3, z2, m)
def grad(params, ils, hls, labels, x, Y, lmbda=0.01):
"""Compute gradient for hypothesis Theta"""
theta1, theta2 = unpack(params, ils, hls, labels)
a1, a2, a3, z2, m = forward(x, theta1, theta2)
d3 = a3 - Y.T
print('Current error: {}'.format(np.mean(np.abs(d3))))
d2 = np.dot(theta2.T, d3) * (np.vstack([np.ones([1, m]), sigmoid_prime(z2)]))
d3 = d3.T
d2 = d2[1:, :].T
t1_grad = np.dot(d2.T, a1.T)
t2_grad = np.dot(d3.T, a2.T)
theta1[0] = np.zeros([1, theta1.shape[1]])
theta2[0] = np.zeros([1, theta2.shape[1]])
t1_grad = t1_grad + (lmbda / m) * theta1
t2_grad = t2_grad + (lmbda / m) * theta2
return np.concatenate([t1_grad.reshape(-1), t2_grad.reshape(-1)])
And here's my prediction function:
def predict(theta1, theta2, x):
"""Predict output using learned weights"""
m = x.shape[0]
h1 = sigmoid(np.hstack((np.ones([m, 1]), x)).dot(theta1.T))
h2 = sigmoid(np.hstack((np.ones([m, 1]), h1)).dot(theta2.T))
return h2.argmax(axis=1)
I can see that the error rate is gradually decreasing with each iteration, generally converging somewhere around 1.26e-05.
What I've tried so far:
Edit: An average output of h2 looks like the following:
[0.5004899 0.45264441]
[0.50048522 0.47439413]
[0.50049019 0.46557124]
[0.50049261 0.45297816]
So, very similar sigmoid outputs for all validation examples.
Upvotes: 88
Views: 139514
Reputation: 41
Same happened to me. The model was predicting one class only for seven class CNN. I tried to change the activation function, batch size but nothing worked. Then changing the learning rate worked for me too.
opt = keras.optimizers.Adam(learning_rate=1e-06)
As you can see, I had to chose a very low learning rate. My number of training samples is 5250 and validation samples 1575.
Upvotes: 3
Reputation: 21
After trying many solutions, it turns out that the problem for me was with the the prediction phase not the training nor model architecture. The method which I was using for prediction was showing zeros for all cases even though I have relatively high validation accuracy because this line:
predicted_class_indices=np.argmax(scores,axis=1)
If you are dealing with binary classification, try:
predict = model.predict(
validation_generator, steps=None, callbacks=None, max_queue_size=10, workers=1,
use_multiprocessing=False, verbose=0
)
Upvotes: 2
Reputation: 111
I came across the problem that model always predict the same label.It confused me for a week.At last ,I solved it by replacing the RELU with other activation function.The RELU will cause the "Dying ReLU" problem.
Before I solved the problem.I tried :
Finally I find that descrese the learning rate from 0.005 to 0.0002 is already valid.
Upvotes: 3
Reputation: 1
the TOPUP answer really work for me. My circumstance is while I am training the model of bert4reco with a large dataset( 4million+ samples), the acc and log_loss always stay between 0.5 and 0.8 during the whole epoch (It cost 8 hour, I print the result every 100 steps). Then I use a very small scale dataset, and a smaller model, finally it works! the model start to learn something, acc and log_loss begin to increase and reach a convergence after 300 epoches !
Conclusively, the TOPUP answer is a good checklist for these kind of questions. And sometime if you can not see any changes in the begining of train, that maybe it will take a lot of time for your model to really learn something. It would better to user mini dataset to assert this, and after that you can wait for it to learn or use some effective equipment such as GPUs or TPUs
Upvotes: 0
Reputation: 1735
I also had the same problem, I do binary classification by using transfer learning with ResNet50, I was able to solve it by replacing:
Dense(output_dim=2048, activation= 'relu')
with
Dense(output_dim=128, activation= 'relu')
and also by removing Keras Augmentation and retrain last layers of RestNet50
Upvotes: 2
Reputation: 181
Same happened to me. Mine was in deeplearning4j
JAVA
library for image classification.It kept on giving the final output of the last training folder for every test. I was able to solve it by decreasing the learning rate.
Approaches can be used :
Upvotes: 9
Reputation: 891
Just incase some one else encounters this problem. Mine was with a deeplearning4j
Lenet(CNN) architecture, It kept on giving the final output of the last training folder for every test.
I was able to solve it by increasing my batchsize
and shuffling the training data
so each batch contained at least a sample from more than one folder. My data class had a batchsize of 1 which was really dangerous
.
Edit: Although another thing I observed recently is having limited sets of training samples per class despite having a large dataset
. e.g. training a neural-network
to recognise human faces
but having only a maximum of say 2 different faces for 1 person
mean while the dataset consists of say 10,000 persons
thus a dataset
of 20,000 faces
in total. A better dataset
would be 1000 different faces
for 10,000 persons
thus a dataset
of 10,000,000 faces
in total. This is relatively necessary if you want avoid overfitting the data to one class so your network
can easily generalise and produce better predictions.
Upvotes: 2
Reputation: 136359
My network does always predict the same class. What is the problem?
I had this a couple of times. Although I'm currently too lazy to go through your code, I think I can give some general hints which might also help others who have the same symptom but probably different underlying problems.
For every class i the network should be able to predict, try the following:
If this doesn't work, there are four possible error sources:
float32
but actually is an integer.See sklearn for details.
The idea is to start with a tiny training dataset (probably only one item). Then the model should be able to fit the data perfectly. If this works, you make a slightly larger dataset. Your training error should slightly go up at some point. This reveals your models capacity to model the data.
Check how often the other class(es) appear. If one class dominates the others (e.g. one class is 99.9% of the data), this is a problem. Look for "outlier detection" techniques.
0.001
is often used / working. This is also relevant if you use Adam as an optimizer.This is inspired by reddit:
imbalanced-learn
Upvotes: 148
Reputation: 1362
Same happened to me. I had an imbalanced dataset (about 66%-33% sample distribution between classes 0 and 1, respectively) and the net was always outputting 0.0
for all samples after the first iteration.
My problem was simply a too high learning rate. Switching it to 1e-05
solved the issue.
More generally, what I suggest to do is to print, before the parameters' update:
And then check the same three items after the parameter update. What you should see in the next batch is a gradual change in the net output. When my learning rate was too high, already in the second iteration the net output would shoot to either all 1.0
s or all 0.0
s for all samples in the batch.
Upvotes: 20
Reputation: 1421
After a week and a half of research I think I understand what the issue is. There is nothing wrong with the code itself. The only two issues that prevent my implementation from classifying successfully are time spent learning and proper selection of learning rate / regularization parameters.
I've had the learning routine running for some tome now, and it's pushing 75% accuracy already, though there is still plenty of space for improvement.
Upvotes: 32