user826955
user826955

Reputation: 3206

Regression with deep learning - huge mse and loss

I am trying to train a model to predict car prices. The dataset is from kaggle: https://www.kaggle.com/vfsousas/autos#autos.csv

I am preparing the data with the following code:

class CarDataset(DataSet):

    def __init__(self, csv_file):
        df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)

        df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
        df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)

        df = df[df["vehicleType"].notnull()]
        df = df[df["notRepairedDamage"].notnull()]
        df = df[df["model"].notnull()]
        df = df[df["fuelType"].notnull()]

        df = df[(df["price"] > 100) & (df["price"] < 100000)]
        df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
        df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
        df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]

        df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
        df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
        df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
        df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]

        df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)

        df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])

        self.df = df
        self.Y = self.df["price"].values
        self.X = self.df.drop(["price"], axis = 1).values

        scaler = StandardScaler()
        scaler.fit(self.X)

        self.X = scaler.transform(self.X)

        self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X, 
                                                                                    self.Y, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)

        self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train, 
                                                                                    self.y_train, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)   

    def get_input_shape(self):
        return (len(self.df.columns)-1, )        # (303, )

This results in the following prepared dataset:

    price  powerPS  kilometer  hasDamage  automatic  fuel  age  vehicleType_andere  vehicleType_bus  vehicleType_cabrio  vehicleType_coupe  ...  brand_rover  brand_saab  brand_seat  brand_skoda  brand_smart  brand_subaru  brand_suzuki  brand_toyota  brand_trabant  brand_volkswagen  brand_volvo
3    1500       75     150000          0          1     0  222                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 1            0
4    3600       69      90000          0          1     1  139                   0                0                   0                  0  ...            0           0           0            1            0             0             0             0              0                 0            0
5     650      102     150000          1          1     0  298                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 0            0
6    2200      109     150000          0          1     0  188                   0                0                   1                  0  ...            0           0           0            0            0             0             0             0              0                 0            0
10   2000      105     150000          0          1     0  192                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 0            0

[5 rows x 304 columns]

hasDamage is a flag (0 or 1) indicating whether or not the car has non-repaired damage
automatic is a flag (0 or 1) indicating whether the car has manual or automatic gear shifting
fuel is 0 for diesel and 1 for gas
age is the age of the car in months

The columns brand, model and vehicleType will be one-hot encoded by using df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"]).

Also, I am going to use a StandardScaler to transform the X values.

The dataset is now containing 303 columns for X and of course Y beeing the "price" column.

With this dataset, regular LinearRegression will achieve a score of ~0.7 on the training and test set.

Now I have tried a deep learning approach using keras, but no matter what I do, the mse and loss is going through the roof, and the model does not seem to be capable of learning anything:

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)

model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))


model.fit(x = dataset.x_train,
          y = dataset.y_train,
          verbose = 1,
          batch_size = 128,
          epochs = 200,
          validation_data = [dataset.x_valid, dataset.y_valid],
          callbacks = callbacks)

score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))

And the summary/training looks like (learning rate is 3e-4):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 6)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                140       
_________________________________________________________________
relu_1 (Activation)          (None, 20)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                420       
_________________________________________________________________
relu_2 (Activation)          (None, 20)                0         
_________________________________________________________________
Output (Dense)               (None, 1)                 21        
=================================================================
Total params: 581
Trainable params: 581
Non-trainable params: 0
_________________________________________________________________
Train on 182557 samples, validate on 60853 samples
Epoch 1/200
182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
Epoch 2/200
182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
Epoch 3/200
182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
Epoch 4/200
182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
Epoch 5/200
182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
Epoch 6/200
182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
Epoch 7/200
182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
...
Epoch 197/200
182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04

Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
Epoch 198/200
182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04

Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
Epoch 199/200
182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
Epoch 200/200
182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
60853/60853 [==============================] - 1s 18us/step
Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]

I am still a beginner in machine learning. Is there any big/obvious mistake in my approach? What am I doing wrong?

Upvotes: 3

Views: 739

Answers (3)

user826955
user826955

Reputation: 3206

Solution

So, after a while I found the kaggle link to the correct dataset. I was using https://www.kaggle.com/vfsousas/autos first, however the same data is also this: https://www.kaggle.com/orgesleka/used-cars-database together with 222 kernels to take a look at. Now looking at https://www.kaggle.com/themanchanda/neural-network-approach showed that this guy is also getting "big numbers" for the loss, which was the main part of my confusion (as I so far had only dealt with "smaller numbers" or "accuracies") and made me think again.

Then it got pretty clear to me:

  • The dataset was prepared correctly
  • The model was working correctly
  • I was using the wrong metrics / comparing to other metrics of sklearns LinearRegression which were not comparable anyway

In a nutshell:

  • An MAE (mean absolute error) around 2000 means, that for a prediction of the car price, in average, it is off/wrong by 2000€ (e.g. correct price was 10.000€ and the model predicts 8.000€ - 12.000)
  • The MSE (mean squared error) of course is a much bigger number, which is to be expected, and not "garbage" or wrong model results as I first interpreted
  • The "accuracy" metrics are meant for classification, and useless for regression
  • The default scoring function of sklearns LinearRegression is the r2-score

So I changed the metrics to "mae" and a custom r2-implementation, so I can compare it to the LinearRegression.
It turned out that after around 100 epochs on first try I ended up at a MAE of 1900 and r2-score of 0.69.

Then I calculated the MAE also for the LinearRegression for comparison purposes, and it evaluated to 2855.417 (r2-score beeing 0.67).

So in fact the deep learning approach was already better both in regards to the MAE and the r2-score. Thus, nothing was wrong, and I can go on and tune/optimize the model now :)

Upvotes: 1

Newbie
Newbie

Reputation: 161

Your model seems to be underfitting.

Try adding more neurons as suggested already. 
And also try to increase the number of layers. 
Try using sigmoid as your activation function. 
Try increasing your learning rate. You can switch between Adam or SGD learning as well. 

Model fitting from scratch always is hit and trial. Try changing one of the parameters at a time. Then change two together and so on. Morever, I would suggest you to look for relevant papers or work already done on dataset similar to yours. This will give you bit of a direction.

Upvotes: 0

meowongac
meowongac

Reputation: 720

There are few suggestions from me.

  1. Add the number of neurons in hidden layers.

  2. Try not to use relu but use tanh in your case.

  3. Remove dropout layers untill your model starting working, then you can add them back & retrain.

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(128)(model_stack)
model_stack = Activation("tanh", name = "tanh_1")(model_stack)

model_stack = Dense(64)(model_stack)
model_stack = Activation("tanh", name = "tanh_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

Upvotes: 1

Related Questions