Reputation: 1041
Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?
Below is an example
I am training my model with:
model.fit(X_train,y)
where X_train=data['var1','var2']
But then during prediction, when I use:
model.predict(X_test)
X_test
is defined as: X_test=data['var1','var3']
where var3
could be a completely different variable than var2
.
Does predict
assume that var3
is the same as var2
because it is the second column in X_test
?
What if:
X_live
was defined as: X_live=data['var2','var1']
Would predict know to re-order X to line them up correctly?
Upvotes: 6
Views: 5004
Reputation: 323
Firstly answer your question "Does predict assume that var3
is the same as var2
because it is the second column in X_test
?"
No; any machine Learning model does not have any such assumption on the data that you are passing into the fit function or the predict function. What the model simply sees is an array of numbers, let it be a multidimensional array of higher order. It is completely on the user to concern about the features.
Let's take a simple classification problem, where you have 2 groups:
Now you want to classify the below individual into any one of the classes.
Age | Height | Weight |
---|---|---|
10 | 120 | 34 |
Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ]
.
But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]
. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.
Hope that answers both your questions.
Upvotes: 2
Reputation: 13498
The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.
Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.
Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2
.
Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.
To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.
Upvotes: 6