Implementing Gradient Boosted Regression Trees in production - mathematically describing the learned model

I have been using logistic regression (LR with start_params as that of the params obtained by the previous (train) data-set & L1 regularization) to model our use case (with some sophisticated feature transformations). I tried Gradient Boosting Classifier on part of the same data, and it appears to give a better fit. Traditionally, I have been using Gradient Boosting Classifier's feature importances and use it as a feedback for my feature engineering for LR.

The classical roadblock that I see taking a full on Gradient Boosting (GB) is that I don't quite understand how to formulate the "learnt tree" into its mathematical construct. So far I mostly used these classification and regression examples from SKLearn documentation to play around & compare the predictions.

Question : I understand that Gradient Boosting is a non-parametric model. Does this mean I can never get the mathematical construct back. Sorry, if this sounds very primitive, but I have had no experience pushing these into production. I.e., unless I really learn & predict the class in real time, how would I "classify" labels into one class or other? How could one use the model in production?

# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 1,
          'learn_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

pred_object=clf.fit(X_train, y_train)
pred_object
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.01, loss='ls',
             max_depth=4, max_features=None, min_samples_leaf=1,
             min_samples_split=1, n_estimators=500, random_state=None,
             subsample=1.0, verbose=0)
# Next, I get the feature importances, 
pred_object.feature_importances_
array([  3.08111834e-02,   1.44739767e-04,   1.31885157e-02,
         2.68202997e-05,   3.01134511e-02,   2.82639689e-01,
         7.67647932e-02,   5.90503853e-02,   7.86688625e-03,
         2.48124873e-02,   8.52094429e-02,   3.93616279e-02,
         3.50009978e-01])

I digged into the dir(pred_object), but couldn't find something that I could immediately comprehend. Is it possible to put the specific mathematical construct into this, given the feature importance array, loss function ='ls', alpha & other parameters? Or, because it's a tree, it will always try to "re-balance" given more data points (test-set), when trying to predict class for new data-points?

Upvotes: 3

Answers (3)

KT.

Reputation: 11440

The SKompiler library might help by translating your trained model to an expression in SQL (or a variety of other languages).

from skompiler import skompile
import sys
sys.setrecursionlimit(10000)  # Necessary for models with many estimators
expr = skompile(gbr.predict)

You can now obtain the SQL form of the model:

sql_code = expr.to('sqlalchemy/sqlite')

or translate it to C code (which is, at least in the case of the GradientBoostingRegressor, compatible with other C-like language, such as C++/C#/Java/Javascript)

c_code = expr.to('sympy/c')

or to Rust:

rust_code = expr.to('sympy/rust')

or to R:

r_code = expr.to('sympy/r')

or even to an Excel formulat:

excel_code = expr.to('excel')

(with n_estimators=500, though, you won't be able to paste this into an Excel cell, because the formula will be longer than the maximum of 8196 characters which Excel allows)

Finally, it is not too complicated to write a compiler from a SKompiler expr object to the kind of representation or code you might need for your deployment.

Upvotes: 0

opt135

Reputation: 141

Gradient boosting, as well as classical decision trees and random forest, all belong to so called tree modeling or tree methods. Which means their scoring logic is typically just "if then, else if.. else". Not sure if this fits the bill of 'mathematical construct'. I suppose not.

Depending on the purpose of needing mathematical construct, I can follow up to expand later. I suspect perhaps the observation or row wise contribution calculation by the built GB model may be the motivation behind the question.

Upvotes: 0

Zach

Reputation: 30331

There are 2 ways to push a GBM "to production".

Pull the data into python, R, or whatever language you used to fit the model. Score it, and write it back to the database (or whatever your production datastore is). This can actually scale pretty well: if you can drop your "events" that need scoring into a queue, you can have 20, 100, or 1000 machines running duplicate copies of your python model, scoring each "event" independently.
Encode your model as a SQL statement, and run it on a database of your choosing. (If you're using a nosql database or some other data store, hopefully you have some way of running if-then-else statements).

1 is pretty self-explanatory. Break your production data up into manageable chunks, and score each chunk on a different machine running your model. This requires some work to build infrastructure, but you don't need to change any of your modeling code.

2 is a little harder to grok: Tree-based models are, at their core, a collection of if-else statements:

if var1>10 and var2<3 then outcome = 0
else if var1<10 and var2<3 then outcome = 1
else if var2<10 and var2<1 then outcome = 0

etc.

Such statements are easy to encode in SQL-based databases, and are also easy to encode in most programming languages. If you can loop through each tree in your GBM in python and turn it into a SQL statement, you can score your model in production by running each SQL statement and the multiplying it's by the correct weight from your GBM. This requires you to transcode your model into another language, but it lets you score you data without pulling it away from your data store.

Upvotes: 1

Implementing Gradient Boosted Regression Trees in production - mathematically describing the learned model

Answers (3)

Related Questions