The cost function in logistic regression is giving nan values

Question

I am implementing logistic regression with gradient descent from scratch in python. I am working on breast cancer dataset. While calculating the cost, I am getting only nan values. I tried to standardize my data and tried decreasing my alpha value but it had no effect. Despite this I am getting a 95.8% accuracy. Something feels wrong. Given below is some of my code:

def hypothesis(b,X):
    z= np.dot(X,b)
    #print(z)
    return sigmoid(z)

def sigmoid(z):
    return 1/(1+np.exp(-1*z))

def FindCost(h,y):
    r = y.shape[0]
    cost = np.sum(y*np.log(h)+(1-y)*np.log(1-h))/r
    #print(cost)
    return cost*-1
    
def gradient_descent(X,y,alpha,epoch):
    r = X.shape[0]
    c = X.shape[1]
    theta = np.ones((c,1))
    min_cost=None
    min_theta=[]
    Cost_list=[]
    for i in range(epoch):
        h = hypothesis(theta,X)
        grad = np.dot(X.T,(h-y))
        theta = theta - alpha*grad
        cost = FindCost(h,y)
        Cost_list.append(cost)
        if min_cost is None or min_cost>cost:
            min_cost=cost
            min_theta=list(theta)
    return min_theta,Cost_list   

def calAccuracy(theta,X,y):
    h = hypothesis(theta,X)
    correct=0
    for i in range(y.shape[0]):
        if h[i]>=0.5:
            if y[i]==1: correct+=1
            print("predicted: ",1,end='		')
        elif h[i]<0.5:
            if y[i]==0: correct+=1
            print("predicted: ",0,end='		')
        print("actual: ",y[i])
    return correct*100/y.shape[0]


alpha = 0.01
epoch = 1000
theta,cost = gradient_descent(x_train,y_train,alpha,epoch)
accuracy = calAccuracy(theta,x_test,y_test)
print(f"the accuracy of the model: {accuracy} %")

Output:
the accuracy of the model: 95.8041958041958 %

standardization of dataset:

for i in range(x_train.shape[1]):
    x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
    x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])

Dataset looks like this: image

I concatenate a column of ones (first column) to my x_train and x_test after I standardized my dataset.

x train data:
 [[ 1.00000000e+00 -6.51198873e-01 -5.29762615e-01  4.19236602e-02
   1.91948751e+00 -7.80449683e-01]
 [ 1.00000000e+00 -6.85821055e-01 -4.00751146e-01 -3.29500919e-02
   1.92747771e+00 -8.07955419e-01]
 [ 1.00000000e+00 -6.76114725e-01 -4.04963490e-01 -6.16161982e-02
   1.93556890e+00 -7.92874483e-01]

y train data:
[[1.]
 [1.]
 [1.]
    
x test data:
 [[ 1.00000000e+00 -5.63066669e-01 -5.36144255e-01 -2.71074811e-01
   1.98575469e+00 -6.15468953e-01]
 [ 1.00000000e+00 -5.57037602e-01 -5.57366708e-01 -2.60414280e-01
   1.98474749e+00 -6.09928901e-01]
 [ 1.00000000e+00 -5.56192661e-01 -5.57892986e-01 -2.62657675e-01
   1.98504143e+00 -6.08298112e-01]

y test data:
 [[0.]
 [1.]
 [0.]

What am I doing wrong here and how to prevent nan values from showing up? Also, if my cost function is giving me nan values, how am I getting so much accuracy?

P.S. I had no null values in my dataset initially and converted the dataset into numpy array.

itsDV7 · Accepted Answer

So the problem was:

for i in range(x_train.shape[1]):
    x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
    x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])

The standardization was not done Feature-Wise / Column Wise. You should use this instead:

for i in range(x_train.shape[1]):
    x_train[:,i] = (x_train[:,i]-np.mean(x_train,axis=1))/np.std(x_train,axis=1)
    x_test[:,i] = (x_test[:,i]-np.mean(x_test,axis=1))/np.std(x_test,axis=1)

This will give correct values of cost with Accuracy: 95.8041958041958 %

The cost function in logistic regression is giving nan values

Answers (1)

Related Questions