Reputation: 1620
I have been using logistic regression (LR with start_params as that of the params obtained by the previous (train) data-set & L1 regularization) to model our use case (with some sophisticated feature transformations). I tried Gradient Boosting Classifier on part of the same data, and it appears to give a better fit. Traditionally, I have been using Gradient Boosting Classifier's feature importances and use it as a feedback for my feature engineering for LR.
The classical roadblock that I see taking a full on Gradient Boosting (GB) is that I don't quite understand how to formulate the "learnt tree" into its mathematical construct. So far I mostly used these classification and regression examples from SKLearn documentation to play around & compare the predictions.
Question : I understand that Gradient Boosting is a non-parametric model. Does this mean I can never get the mathematical construct back. Sorry, if this sounds very primitive, but I have had no experience pushing these into production. I.e., unless I really learn & predict the class in real time, how would I "classify" labels into one class or other? How could one use the model in production?
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 1,
'learn_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
pred_object=clf.fit(X_train, y_train)
pred_object
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.01, loss='ls',
max_depth=4, max_features=None, min_samples_leaf=1,
min_samples_split=1, n_estimators=500, random_state=None,
subsample=1.0, verbose=0)
# Next, I get the feature importances,
pred_object.feature_importances_
array([ 3.08111834e-02, 1.44739767e-04, 1.31885157e-02,
2.68202997e-05, 3.01134511e-02, 2.82639689e-01,
7.67647932e-02, 5.90503853e-02, 7.86688625e-03,
2.48124873e-02, 8.52094429e-02, 3.93616279e-02,
3.50009978e-01])
I digged into the dir(pred_object)
, but couldn't find something that I could immediately comprehend. Is it possible to put the specific mathematical construct into this, given the feature importance array, loss function ='ls', alpha & other parameters? Or, because it's a tree, it will always try to "re-balance" given more data points (test-set), when trying to predict class for new data-points?
Upvotes: 3
Views: 3084
Reputation: 11440
The SKompiler library might help by translating your trained model to an expression in SQL (or a variety of other languages).
from skompiler import skompile
import sys
sys.setrecursionlimit(10000) # Necessary for models with many estimators
expr = skompile(gbr.predict)
You can now obtain the SQL form of the model:
sql_code = expr.to('sqlalchemy/sqlite')
or translate it to C code (which is, at least in the case of the GradientBoostingRegressor, compatible with other C-like language, such as C++/C#/Java/Javascript)
c_code = expr.to('sympy/c')
or to Rust:
rust_code = expr.to('sympy/rust')
or to R:
r_code = expr.to('sympy/r')
or even to an Excel formulat:
excel_code = expr.to('excel')
(with n_estimators=500
, though, you won't be able to paste this into an Excel cell, because the formula will be longer than the maximum of 8196 characters which Excel allows)
Finally, it is not too complicated to write a compiler from a SKompiler expr
object to the kind of representation or code you might need for your deployment.
Upvotes: 0
Reputation: 141
Gradient boosting, as well as classical decision trees and random forest, all belong to so called tree modeling or tree methods. Which means their scoring logic is typically just "if then, else if.. else". Not sure if this fits the bill of 'mathematical construct'. I suppose not.
Depending on the purpose of needing mathematical construct, I can follow up to expand later. I suspect perhaps the observation or row wise contribution calculation by the built GB model may be the motivation behind the question.
Upvotes: 0
Reputation: 30331
There are 2 ways to push a GBM "to production".
1 is pretty self-explanatory. Break your production data up into manageable chunks, and score each chunk on a different machine running your model. This requires some work to build infrastructure, but you don't need to change any of your modeling code.
2 is a little harder to grok: Tree-based models are, at their core, a collection of if-else statements:
if var1>10 and var2<3 then outcome = 0
else if var1<10 and var2<3 then outcome = 1
else if var2<10 and var2<1 then outcome = 0
etc.
Such statements are easy to encode in SQL-based databases, and are also easy to encode in most programming languages. If you can loop through each tree in your GBM in python and turn it into a SQL statement, you can score your model in production by running each SQL statement and the multiplying it's by the correct weight from your GBM. This requires you to transcode your model into another language, but it lets you score you data without pulling it away from your data store.
Upvotes: 1