tost
tost

Reputation: 43

Training XGBoost over a single number metric

Assuming I am building an XGBoost model in Python (xgboost version 2.0.3) (regression or classification here it is does not matter at all) to predict a target variable in a stock market time series analysis.

The target for example may be: next value in the time series or a binary variable set to 1 if the next value is higher than the previous one, 0 otherwise.

To train the model is it possible to use, for example, the MSE in the regression problem or the 'binary logistic' in the classification one.

After the training, it is possible to backtest a strategy based on the output of the model in the test set and compute the overall return.

My question is: using the xgboost scikit-learn interface, would it be possible to train the model on the performance metric used to backtest the strategy?

E.g.: to maximize the overall return in the training set following the strategy rules.

on the xgboost library website, it is shown how to use a custom loss function for training the model:

def softprob_obj(labels: np.ndarray, predt: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    rows = labels.shape[0]
    classes = predt.shape[1]
    grad = np.zeros((rows, classes), dtype=float)
    hess = np.zeros((rows, classes), dtype=float)
    eps = 1e-6
    for r in range(predt.shape[0]):
        target = labels[r]
        p = softmax(predt[r, :])
        for c in range(predt.shape[1]):
            g = p[c] - 1.0 if c == target else p[c]
            h = max((2.0 * p[c] * (1.0 - p[c])).item(), eps)
            grad[r, c] = g
            hess[r, c] = h

    grad = grad.reshape((rows * classes, 1))
    hess = hess.reshape((rows * classes, 1))
    return grad, hess

clf = xgb.XGBClassifier(tree_method="hist", objective=softprob_obj)

The objective function requires the computation of the gradient and the hessian.

Assuming a function defined as below:

def maximize_performance_metric(y_true: np.ndarray, y_pred: np.ndarray):
   # metrics computation (eg: overall return using y_pred
   overall_return = get_overall_return(y_pred, real_prices, ...) # overall_return is a float
   return grad, hess

Is it possible to compute the gradient and the hessian with respect to the overall return and so, training the model using this custom loss function?

How can the function maximize_performance_metric() have access to the variable that contains real_prices ( needed for the overall return computation )?

Upvotes: 1

Views: 82

Answers (1)

user3666197
user3666197

Reputation: 1

Hypothesis : "regression or classification here it is does not matter at all"

Well, it actually does.
( Prove me to be wrong, if you can, yet some 20+ years of HFT & Quant strategy 24/7/365 massive-parallel backtesting factory design & technology operations present here. I remain open to evidence of a working counter-example to disprove this claim by evidence, not by opinions. 20+ years of SDAAT-based dMM-tools can re-test any such presented counter-example candidate-claim. )

Q1 : "using (...) would it be possible to train the model on the performance metric used to backtest the strategy?"

Q2 : "Is it possible to compute the gradient and the hessian with respect to the overall return and so, training the model using this custom loss function?"

Q3 : "How can the function maximize_performance_metric() have access to the variable that contains real_prices ( needed for the overall return computation )?"

Let's start with some reality-checks, ok? enter image description here How many "training"-examples can your current predictor get its "training"-phase on? That matters. See Wassily HOEFFDING inequality below for more details.

import numpy as np

MASK = "We need at least {0:9d} training examples to drive a PREDICTOR-instance under Proba( {1: >5.3f} ) to deliver a prediction beyond +/- Epsilon-( {2: >5.3f} )-distance from y_GROUND_TRUTH"

for     EPS_NORMALISED_ERROR      in ( 1., 0.5, 0.25, 0.1, 0.05, 0.01, 0.005, 0.001 ):
    for ALPHA_PROBA_OF_BEYOND_EPS in ( 1., 0.5, 0.25, 0.1, 0.05, 0.01, 0.005, 0.001 ):
        print( MASK.format( int( np.log( 2. / ALPHA_PROBA_OF_BEYOND_EPS )
                                 / 2.
                                 / EPS_NORMALISED_ERROR**2
                                 ),                    # 0: N per Hoeffding
                            ALPHA_PROBA_OF_BEYOND_EPS, # 1: ALPHA
                            EPS_NORMALISED_ERROR       # 2: EPS
                            )
               )

If and only if you have at least a few orders of magnitude more labeled data, than HOEFFDING pre-scribed as a must for reaching a certainty better than ( 1 - ALPHA ) not to hear "answers" worse than EPSILON-away from y_GROUND_TRUTH for any and all your used, already known, past examples, it makes sense to drill further.

Re-using famous Stephen LEACOCK's Juggins strategy, first published 100+ years ago, the lowest hanging fruit is the Q3.

A dirty anti-Pythonista's move might be to use such data "declared" as global, for cases, when you need to assign new value(s) thereinto, otherwise a plain reference to such variable shall get resolved in actual namespaces and fetch the needed values into your maximize_performance_metric()-function.

No matter how simple and promising this might sound, the things tend to get complicated - namely when a single Python-interpreter process gets turned into some form of multi-process ( not threads ) multi-processing and/or even distributed, over a pool of computing nodes, for increased overall performance. There this simplest way will not work. In all cases, there are chances to use data-on-demand pipelines ( ZeroMQ, nanomsg and other low-overhead, ultra-performing FinTech Quant tools for doing this no matter how distributed ).

Your Q2 gets us into another corner. ML-tools, XGB-toys included, are nothing but trivial HyperParametrised-toys. Classical toys get .train()-method "loaded" with a task to "minimise" some penalty-(or-reward-inversed)-function ( for a given set of HyperParameter values, be theirs actual values explicitly set or implicitly derived from (hopefully documented) defaults ).

Nothing else.

It is not fair to expect these "minimiser"-engines do anything else than very this.

Given a set of "supervised" ( X_[M_features,N_examples], y_GROUND_TRUTH_answers[N_examples] ), the .train()-methods govern the ML-toy to adapt its internal state, right according the "minimiser" penalty-function ( thus you get tasked to deliver also the grad, hess for cases, you opt to not re-use the built-in penalty functions, for which these are known and implemented, outsourcing such duty and implementation thereof onto you, to do so and deliver this via local grad, hess values per call ). Having done so, the ML-toy has reached no other state, but the such one, that will get the lowest possible amount of penalty ( ... for the PAST market data ... Here no one ought be surprised, as we did not say the ML-toy to anything else but this very goal - set yourself ( internally, within your implemented, so hardwired, internal numerical logic ) so as to receive a minimum amount of accumulated penalties for those examples, we gave you. In other words, any other internal re-configuration of the hardwired logic will deliver higher penalty for the same data, than the internal "minimiser" self-adapted the ML-toy into. That is great, no doubts, massive amounts of computation powers was consumed to reach such "minimum"-penalty goal, yet such state is "minimal" only ... for the PAST market data ... which we all know beforehand, will repeat with a probability close to zero. )

This is why the ML-practitioners carefully distinguish DATA-driven .train()-phase, for which a fixed, pre-programmed logic is hardwired, from HyperParameters-driven "minimiser"-tuning and from actual ways, how ML-toys get improved results for "ability to (somehow) generalise" ( to escape from a cage of "training"-DATA dictated dangerously local, never re-appearing minimum ) by means of Quant Feature Engineering etc.

Q1 is the most complicated, if ought be answered honestly and seriously. First, as said above, it is nonsense not to force, by the use of the .train()-method, the ML-toy to internally self-tune on the very Quant-fair performance, is it?

Non-Quants often start to understand this way too late, while academic papers carry the same kind of nonsense so often, as they do not bear the adverse costs of losses, do they?

Why is this claimed to be most complicated? Because of heteroskedasticity.

PROOF:
Imagine for a moment, we have already reached a point we have operational a Dream-Machine system, that can trade our well-prepared, always winning strategy ( as was already thoroughly Quant-proven in all our backtesting scenarios under indeed all possible market conditions ).

What will happen with such a cool Dream-Machine system on the real Markets?

Theory, nose-dived into temporal-episode-data minimised, local-only, non-heteroskedastic ML-toys, will answer "It will apply our best-ever trading strategy, that will always be better than any other and we will beat the Market." - it cannot answer anything else, as the ML-"minimiser"-toys indeed do nothing else but this.

That would also mean, that such Dream-Machine strategy will progress forever and gain and accumulate profits growing in principle infinitely large, over time.

We know, that this is not possible to practically materialise, if for nothing else, then for the overall global limit of all money ( funds ) operating in Financial Markets.

This proof by contradiction ( Q.E.D. ) suffices to refute the hypothesis a such tool, knowingly trained on a static set of no matter how large or small local-context of past data, having no internal adaptations so as to at least ex-post somehow reflect any actual external exosystem evolutions, the less having any kind of robustness to operate inside a heteroskedastic, heavily-constrained, fast self-evolving & self-modifying exosystem, can ever meet the initial hypothesis in real world of Financial Markets.

Upvotes: 0

Related Questions