Tartaglia
Tartaglia

Reputation: 1041

Does name and order of Features matter for prediction algorithm

Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?

Below is an example

I am training my model with:

model.fit(X_train,y)

where X_train=data['var1','var2']

But then during prediction, when I use:

model.predict(X_test)

X_test is defined as: X_test=data['var1','var3']

where var3 could be a completely different variable than var2.

Does predict assume that var3 is the same as var2 because it is the second column in X_test?

What if:

X_live was defined as: X_live=data['var2','var1']

Would predict know to re-order X to line them up correctly?

Upvotes: 6

Views: 5004

Answers (2)

Aruparna Maity
Aruparna Maity

Reputation: 323

Firstly answer your question "Does predict assume that var3 is the same as var2 because it is the second column in X_test?"

No; any machine Learning model does not have any such assumption on the data that you are passing into the fit function or the predict function. What the model simply sees is an array of numbers, let it be a multidimensional array of higher order. It is completely on the user to concern about the features.

Let's take a simple classification problem, where you have 2 groups:

  • First one is a group of kids, with short height, and thereby lesser weight,
  • Second group is of mature adults, with higher age, height and weight.

Now you want to classify the below individual into any one of the classes.

Age Height Weight
10 120 34

Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ]. But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.

Hope that answers both your questions.

Upvotes: 2

Primusa
Primusa

Reputation: 13498

The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.

Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.

Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.

Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.

To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.

Upvotes: 6

Related Questions