Reputation: 3206
I am trying to train a model to predict car prices. The dataset is from kaggle: https://www.kaggle.com/vfsousas/autos#autos.csv
I am preparing the data with the following code:
class CarDataset(DataSet):
def __init__(self, csv_file):
df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)
df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)
df = df[df["vehicleType"].notnull()]
df = df[df["notRepairedDamage"].notnull()]
df = df[df["model"].notnull()]
df = df[df["fuelType"].notnull()]
df = df[(df["price"] > 100) & (df["price"] < 100000)]
df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]
df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]
df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)
df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])
self.df = df
self.Y = self.df["price"].values
self.X = self.df.drop(["price"], axis = 1).values
scaler = StandardScaler()
scaler.fit(self.X)
self.X = scaler.transform(self.X)
self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X,
self.Y,
test_size = 0.25,
random_state = 0)
self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train,
self.y_train,
test_size = 0.25,
random_state = 0)
def get_input_shape(self):
return (len(self.df.columns)-1, ) # (303, )
This results in the following prepared dataset:
price powerPS kilometer hasDamage automatic fuel age vehicleType_andere vehicleType_bus vehicleType_cabrio vehicleType_coupe ... brand_rover brand_saab brand_seat brand_skoda brand_smart brand_subaru brand_suzuki brand_toyota brand_trabant brand_volkswagen brand_volvo
3 1500 75 150000 0 1 0 222 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1 0
4 3600 69 90000 0 1 1 139 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0
5 650 102 150000 1 1 0 298 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
6 2200 109 150000 0 1 0 188 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0 0
10 2000 105 150000 0 1 0 192 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
[5 rows x 304 columns]
hasDamage
is a flag (0 or 1) indicating whether or not the car has non-repaired damage
automatic
is a flag (0 or 1) indicating whether the car has manual or automatic gear shifting
fuel
is 0 for diesel and 1 for gas
age
is the age of the car in months
The columns brand
, model
and vehicleType
will be one-hot encoded by using df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])
.
Also, I am going to use a StandardScaler
to transform the X values.
The dataset is now containing 303 columns for X and of course Y beeing the "price" column.
With this dataset, regular LinearRegression
will achieve a score of ~0.7 on the training and test set.
Now I have tried a deep learning approach using keras, but no matter what I do, the mse
and loss is going through the roof, and the model does not seem to be capable of learning anything:
input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)
model_stack = Dense(1, name = "Output")(model_stack)
model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])
model.summary()
callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))
model.fit(x = dataset.x_train,
y = dataset.y_train,
verbose = 1,
batch_size = 128,
epochs = 200,
validation_data = [dataset.x_valid, dataset.y_valid],
callbacks = callbacks)
score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))
And the summary/training looks like (learning rate is 3e-4
):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 6) 0
_________________________________________________________________
dense_1 (Dense) (None, 20) 140
_________________________________________________________________
relu_1 (Activation) (None, 20) 0
_________________________________________________________________
dense_2 (Dense) (None, 20) 420
_________________________________________________________________
relu_2 (Activation) (None, 20) 0
_________________________________________________________________
Output (Dense) (None, 1) 21
=================================================================
Total params: 581
Trainable params: 581
Non-trainable params: 0
_________________________________________________________________
Train on 182557 samples, validate on 60853 samples
Epoch 1/200
182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
Epoch 2/200
182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
Epoch 3/200
182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
Epoch 4/200
182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
Epoch 5/200
182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
Epoch 6/200
182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
Epoch 7/200
182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
...
Epoch 197/200
182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04
Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
Epoch 198/200
182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04
Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
Epoch 199/200
182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
Epoch 200/200
182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
60853/60853 [==============================] - 1s 18us/step
Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]
I am still a beginner in machine learning. Is there any big/obvious mistake in my approach? What am I doing wrong?
Upvotes: 3
Views: 739
Reputation: 3206
So, after a while I found the kaggle link to the correct dataset. I was using https://www.kaggle.com/vfsousas/autos first, however the same data is also this: https://www.kaggle.com/orgesleka/used-cars-database together with 222 kernels to take a look at. Now looking at https://www.kaggle.com/themanchanda/neural-network-approach showed that this guy is also getting "big numbers" for the loss, which was the main part of my confusion (as I so far had only dealt with "smaller numbers" or "accuracies") and made me think again.
Then it got pretty clear to me:
sklearn
s LinearRegression
which were not comparable anywayIn a nutshell:
sklearn
s LinearRegression
is the r2-scoreSo I changed the metrics to "mae" and a custom r2-implementation, so I can compare it to the LinearRegression
.
It turned out that after around 100 epochs on first try I ended up at a MAE of 1900 and r2-score of 0.69.
Then I calculated the MAE also for the LinearRegression
for comparison purposes, and it evaluated to 2855.417 (r2-score beeing 0.67).
So in fact the deep learning approach was already better both in regards to the MAE and the r2-score. Thus, nothing was wrong, and I can go on and tune/optimize the model now :)
Upvotes: 1
Reputation: 161
Your model seems to be underfitting.
Try adding more neurons as suggested already.
And also try to increase the number of layers.
Try using sigmoid as your activation function.
Try increasing your learning rate. You can switch between Adam or SGD learning as well.
Model fitting from scratch always is hit and trial. Try changing one of the parameters at a time. Then change two together and so on. Morever, I would suggest you to look for relevant papers or work already done on dataset similar to yours. This will give you bit of a direction.
Upvotes: 0
Reputation: 720
There are few suggestions from me.
Add the number of neurons in hidden layers.
Try not to use relu
but use tanh
in your case.
Remove dropout
layers untill your model starting working, then you can add them back & retrain.
input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(128)(model_stack)
model_stack = Activation("tanh", name = "tanh_1")(model_stack)
model_stack = Dense(64)(model_stack)
model_stack = Activation("tanh", name = "tanh_2")(model_stack)
model_stack = Dense(1, name = "Output")(model_stack)
model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])
model.summary()
Upvotes: 1