Reputation: 21
I've built a model to predict loan suitability on a Kaggle dataset here
dataset = df.values
X = dataset[:,0:11].astype(float)
Y = dataset[:,11]
scaler = StandardScaler()
X = scaler.fit_transform(X)
model = Sequential()
model.add(Dense(5, input_dim=11, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y, epochs=150, batch_size=10, verbose=0)
scores = model.evaluate( X, Y, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
model.save("model.h5")
This model provides accuracy of 81.43%. The problem arises when I try to make a prediction based on this model. Here I've passed the third row of data in the dataset to the model as an array and the probability, as it is for other rows, is zero.
model = load_model('model.h5')
X = np.array([[0, 1, 0, 0, 1, 3000, 0, 66, 360, 1, 0]], dtype=np.float32)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X = scaler.transform(X.reshape(1, -1))
pred = model.predict(X)
print(X)
print("Probability that eligibility = 1:")
print(pred)
I get the output:
[[ 0.000e+00 -1.000e+00 -1.000e+00 0.000e+00 0.000e+00 -4.583e+03
-1.508e+03 -1.280e+02 -3.600e+02 -1.000e+00 -1.000e+00]]
Probability that eligibility = 1:
[[0.]]
I have not been able to find a solution here on stackoverflow or other sites.
Upvotes: 0
Views: 380
Reputation: 2086
Do not fit a new Scalar object for new data, You need to save the StandardScaler you used for train data in addition to your model , load it and transform your new data ,
save it
from pickle import dump
scaler = StandardScaler()
X = scaler.fit_transform(X)
dump(scaler, open('scaler.pkl', 'wb'))
then load it when you wanna predict
from pickle import load
scaler = load(open('scaler.pkl', 'rb'))
X = np.array([[0, 1, 0, 0, 1, 3000, 0, 66, 360, 1, 0]], dtype=np.float32)
scaler.transform(X)
Upvotes: 1
Reputation: 1020
You're performing standardization for the training part, which is great. However, you're predicted with values that are mis-standardized. When you perform standardization for the training part, you calculate the mean and std of each column and make the operation.
However, the predicting part is not good because you calculate the mean and std of the row.
The correct training process is :
X_standard = (X - mean_column) / std_column
The correct predicting process is :
Upvotes: 0