Does name and order of Features matter for prediction algorithm

Question

Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?

Below is an example

I am training my model with:

model.fit(X_train,y)

where X_train=data['var1','var2']

But then during prediction, when I use:

model.predict(X_test)

X_test is defined as: X_test=data['var1','var3']

where var3 could be a completely different variable than var2.

Does predict assume that var3 is the same as var2 because it is the second column in X_test?

What if:

X_live was defined as: X_live=data['var2','var1']

Would predict know to re-order X to line them up correctly?

Primusa · Accepted Answer

The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.

Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.

Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.

Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.

To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.

Does name and order of Features matter for prediction algorithm

Answers (2)

Related Questions