semidevil
semidevil

Reputation: 69

xgboost - feature mismatch when I predict on my test data

I"m using xgboost to train some data and then I want to score it on a test set. My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.

I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).

the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?

Upvotes: 1

Views: 750

Answers (1)

SAL
SAL

Reputation: 632

If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help

  for col in featurs_object:
      X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
      X_col = pd.get_dummies(X[col])
      X = X.drop(col,axis=1)
      X_col.columns = X_col.columns.tolist()
      frames = [X_col, X] 
      X = pd.concat(frames,axis=1)
  X = pd.concat([X,df_continous],axis=1)
  X_train, X_test, y_train, y_test = train_test_split(X, y,  
                                                    test_size = 0.3,  
                                                    random_state = 1)
  • featurs_object : consists of all categorical columns which you want to include for model building.
  • df : your entire dataset (post cleanup)
  • df_continous : Subset of df, with only continuous features.

Upvotes: 2

Related Questions