Reputation: 3
I am implementing logistic regression with gradient descent from scratch in python. I am working on breast cancer dataset. While calculating the cost, I am getting only nan values. I tried to standardize my data and tried decreasing my alpha value but it had no effect. Despite this I am getting a 95.8% accuracy. Something feels wrong. Given below is some of my code:
def hypothesis(b,X):
z= np.dot(X,b)
#print(z)
return sigmoid(z)
def sigmoid(z):
return 1/(1+np.exp(-1*z))
def FindCost(h,y):
r = y.shape[0]
cost = np.sum(y*np.log(h)+(1-y)*np.log(1-h))/r
#print(cost)
return cost*-1
def gradient_descent(X,y,alpha,epoch):
r = X.shape[0]
c = X.shape[1]
theta = np.ones((c,1))
min_cost=None
min_theta=[]
Cost_list=[]
for i in range(epoch):
h = hypothesis(theta,X)
grad = np.dot(X.T,(h-y))
theta = theta - alpha*grad
cost = FindCost(h,y)
Cost_list.append(cost)
if min_cost is None or min_cost>cost:
min_cost=cost
min_theta=list(theta)
return min_theta,Cost_list
def calAccuracy(theta,X,y):
h = hypothesis(theta,X)
correct=0
for i in range(y.shape[0]):
if h[i]>=0.5:
if y[i]==1: correct+=1
print("predicted: ",1,end='\t\t')
elif h[i]<0.5:
if y[i]==0: correct+=1
print("predicted: ",0,end='\t\t')
print("actual: ",y[i])
return correct*100/y.shape[0]
alpha = 0.01
epoch = 1000
theta,cost = gradient_descent(x_train,y_train,alpha,epoch)
accuracy = calAccuracy(theta,x_test,y_test)
print(f"the accuracy of the model: {accuracy} %")
Output:
the accuracy of the model: 95.8041958041958 %
standardization of dataset:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
Dataset looks like this: image
I concatenate a column of ones (first column) to my x_train and x_test after I standardized my dataset.
x train data:
[[ 1.00000000e+00 -6.51198873e-01 -5.29762615e-01 4.19236602e-02
1.91948751e+00 -7.80449683e-01]
[ 1.00000000e+00 -6.85821055e-01 -4.00751146e-01 -3.29500919e-02
1.92747771e+00 -8.07955419e-01]
[ 1.00000000e+00 -6.76114725e-01 -4.04963490e-01 -6.16161982e-02
1.93556890e+00 -7.92874483e-01]
y train data:
[[1.]
[1.]
[1.]
x test data:
[[ 1.00000000e+00 -5.63066669e-01 -5.36144255e-01 -2.71074811e-01
1.98575469e+00 -6.15468953e-01]
[ 1.00000000e+00 -5.57037602e-01 -5.57366708e-01 -2.60414280e-01
1.98474749e+00 -6.09928901e-01]
[ 1.00000000e+00 -5.56192661e-01 -5.57892986e-01 -2.62657675e-01
1.98504143e+00 -6.08298112e-01]
y test data:
[[0.]
[1.]
[0.]
What am I doing wrong here and how to prevent nan values from showing up? Also, if my cost function is giving me nan values, how am I getting so much accuracy?
P.S. I had no null values in my dataset initially and converted the dataset into numpy array.
Upvotes: 0
Views: 409
Reputation: 854
So the problem was:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
The standardization was not done Feature-Wise / Column Wise. You should use this instead:
for i in range(x_train.shape[1]):
x_train[:,i] = (x_train[:,i]-np.mean(x_train,axis=1))/np.std(x_train,axis=1)
x_test[:,i] = (x_test[:,i]-np.mean(x_test,axis=1))/np.std(x_test,axis=1)
This will give correct values of cost with Accuracy: 95.8041958041958 %
Upvotes: 0