Baron Yugovich
Baron Yugovich

Reputation: 4307

Sklearn fit vs predict, order of columns matters?

Say X1 and X2 are 2 pandas dataframes with the same columns, but possibly in different order. Assume model is some sort of sklearn model, like LassoCV. Say I do model.fit(X1, y), and then model.predict(X2). Is the fact that the columns are in different order a problem, or does model save weights my name of column?

Also, same question, but what if X1 and X2 and numpy arrays?

Upvotes: 14

Views: 7533

Answers (1)

sacuL
sacuL

Reputation: 51335

Yes, I believe it will matter, as sklearn will convert the pandas DataFrame to an array of values (essentially calling X1.values), and not pay attention to the column names. However, it's an easy fix. Just use:

X2 = X2[X1.columns]

And it will re-order X2's columns to the same order as X1

The same is true of numpy arrays, of course, because it will fit the model on the columns as they are in X1, so when you predict on X2, it will just predict based on the order of the columns in X1

Example:

Take these 2 dataframes:

>>> X1
   a  b
0  1  5
1  2  6
2  3  7

>>> X2
   b  a
0  5  3
1  4  2
2  6  1

The model is fit on X1.values:

array([[1, 5],
       [2, 6],
       [3, 7]])

And you predict on X2.values:

>>> X2.values
array([[5, 3],
       [4, 2],
       [6, 1]])

There is no way for the model to know that the columns are switched. So switch them manually:

X2 = X2[X1.columns]

>>> X2
   a  b
0  3  5
1  2  4
2  1  6

Upvotes: 26

Related Questions