aarav singh luthra
aarav singh luthra

Reputation: 11

I just trained my first ML model based on the titanic dataset from kaggle.I am getting an RMSE value of ~0.4 is it good?

Please Note : I trained my model only on the basis of numerical columns and not the string columns

And please suggest some resources to go further into machine learning as I really like this subject.

Thank you

Here is the code and gives the following output :-

train rmse: 0.42 test rmse: 0.43

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd
import matplotlib.pyplot as plt

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dftest = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')

dftrain.loc[dftrain['fare'] == 0, 'fare'] = 34.85
plt.plot(list(dftrain.age), list(dftrain.fare), '.',markersize = 1)

dftrain = dftrain.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
X = dftrain.loc[:, dftrain.columns != 'survived']
y = dftrain.loc[:, 'survived']

model = Sequential()
model.add(Dense(128, activation = 'relu', input_dim = 4))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'adam' , loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(X, y , epochs = 200)

dftest = dftest.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
A = dftest.loc[:, dftest.columns != 'survived']
b = dftest.loc[:, 'survived']

from sklearn.metrics import mean_squared_error
import numpy as np

train_pred = model.predict(X)
train_rmse = np.sqrt(mean_squared_error(y, train_pred))
test_pred = model.predict(A)
test_rmse = np.sqrt(mean_squared_error(b, test_pred))

print("train rmse: {:0.2f}".format(train_rmse))
print("test rmse: {:0.2f}".format(test_rmse))```




Upvotes: 1

Views: 199

Answers (1)

MichalW
MichalW

Reputation: 11

First, root mean square error might not be a good score to look at in classification problems in the first place. For reasons why, refer to either this post or this stats stack exchange post.

Second, you're training a somewhat large neural network (with many parameters) compared to the available amount of training data (there were only 2224 passengers and crew members). When you have a comparable number of parameters in your model to the amount of training data, you run a risk of overfitting. Refer to this tutorial to learn what you can find about your model from looking at training/validation loss curves and how you can combat over/under fitting. You can experiment with different learning rates, number of epochs, batch sizes, normalization methods etc.

You might also want to take a look at other metrics like accuracy score and precision and recall

Upvotes: 1

Related Questions