wildcat89
wildcat89

Reputation: 1285

How to load an already trained XGBoost model to run on a new dataset?

New to XGBoost so forgive me. I've trained a model on the Boston housing dataset and saved it locally. Now, I want to load the model, and use a new dataset similar in structure to predict their labels. How would I go about doing this in Python 3.6? I have this from the training step so far:

UPDATED TO TRY PICKLE INSTEAD

UPDATE 2: Added cause of error, preprocessing.

UPDATE 3: See below comments for answer

    print('Splitting the features and label columns...')
    X, y = data.iloc[:,:-1],data.iloc[:,-1]

    print('Converting dataset to Dmatrix structure to use later on...')
    data_dmatrix = xgb.DMatrix(data=X,label=y)
    #....
    # Some more stuff here.
    #....
    print('Now, train the model...')
    grid = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

    # Now, save the model for later use on unseen data
    import pickle
    model = pickle.dump(grid, open("pima.pickle.dat", "wb"))

    #.....after some time has passed

    # Now, load the model for use on a new dataset
    loaded_model = pickle.load(open("pima.pickle.dat", "rb"))
    print(loaded_model.feature_names)

    # Now, load a new dataset to run the model on and make predictions for
    dataset = pd.read_csv('Boston Housing Data.csv', skiprows=1))

    # Split the dataset into features and label
    # X = use all rows, up until the last column, which is the label or predicted column
    # y = use all rows in the last column of the dataframe ('Price')
    print('Splitting the new features and label column up for predictions...')
    X, y = dataset.iloc[:,:-1],dataset.iloc[:,-1]


    # Make predictions on labels of the test set
    preds = loaded_model.predict(X)

Now I get the traceback:

        preds = loaded_model.predict(X)
    AttributeError: 'DataFrame' object has no attribute 'feature_names'

Any ideas? I'm noticing that when I print the loaded_model.feature_names I get:

['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

...but the actual .csv file has an extra column 'PRICE' which was appended before the training and used as the label during training. Does this mean anything?

I didn't think I'd have to go through the whole train and test split thing because I'm not looking to actually re-train the model, just use it on this new dataset to make predictions, and show the RMSE from the actuals on the new dataset. All the tutorials I see online don't go in to the step of implementing the model on new data. Thoughts? Thanks!

Upvotes: 1

Views: 6616

Answers (1)

LazyCoder
LazyCoder

Reputation: 1265

You need to use the same preprocessing used on the training set over the test set in order to make any kind of predictions. Your problem is because, you have used DMatrix structure in training, which is required BTW.

print('Converting dataset to Dmatrix structure to use later on...')
    data_dmatrix = xgb.DMatrix(data=X,label=y)

but failed to use that preprocessing on testing set. Use same preprocessing for all of training set, validation set and testing set. Your model will be golden.

Upvotes: 3

Related Questions