Reputation: 11
Please Note : I trained my model only on the basis of numerical columns and not the string columns
And please suggest some resources to go further into machine learning as I really like this subject.
Thank you
Here is the code and gives the following output :-
train rmse: 0.42 test rmse: 0.43
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd
import matplotlib.pyplot as plt
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dftest = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
dftrain.loc[dftrain['fare'] == 0, 'fare'] = 34.85
plt.plot(list(dftrain.age), list(dftrain.fare), '.',markersize = 1)
dftrain = dftrain.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
X = dftrain.loc[:, dftrain.columns != 'survived']
y = dftrain.loc[:, 'survived']
model = Sequential()
model.add(Dense(128, activation = 'relu', input_dim = 4))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'adam' , loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(X, y , epochs = 200)
dftest = dftest.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
A = dftest.loc[:, dftest.columns != 'survived']
b = dftest.loc[:, 'survived']
from sklearn.metrics import mean_squared_error
import numpy as np
train_pred = model.predict(X)
train_rmse = np.sqrt(mean_squared_error(y, train_pred))
test_pred = model.predict(A)
test_rmse = np.sqrt(mean_squared_error(b, test_pred))
print("train rmse: {:0.2f}".format(train_rmse))
print("test rmse: {:0.2f}".format(test_rmse))```
Upvotes: 1
Views: 199
Reputation: 11
First, root mean square error might not be a good score to look at in classification problems in the first place. For reasons why, refer to either this post or this stats stack exchange post.
Second, you're training a somewhat large neural network (with many parameters) compared to the available amount of training data (there were only 2224 passengers and crew members). When you have a comparable number of parameters in your model to the amount of training data, you run a risk of overfitting. Refer to this tutorial to learn what you can find about your model from looking at training/validation loss curves and how you can combat over/under fitting. You can experiment with different learning rates, number of epochs, batch sizes, normalization methods etc.
You might also want to take a look at other metrics like accuracy score and precision and recall
Upvotes: 1