Kspr
Kspr

Reputation: 685

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.

But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?

So I both want \hat{y} and P(Y|X) as output. Any idea how to do this?

Upvotes: 2

Views: 3247

Answers (3)

gio.visani
gio.visani

Reputation: 21

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model. Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.

The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.

After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Upvotes: 2

Sysmetryx
Sysmetryx

Reputation: 33

As mentioned before, there is no probability associated with regression.

However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.

One thing to note though, is that the variance might not be the same along the data. Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.

If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.

You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.

  • EDIT This is how I would do it. Obviously the outputs are your real outputs. import numpy as np import matplotlib.pyplot as plt from scipy import integrate from scipy.interpolate import interp1d N = 1000 # The number of sample mean = 0 std = 1 outputs = np.random.normal(loc=mean, scale=std, size=N) # We want to get a normed histogram (since this is PDF, if we integrate # it must be equal to 1) nbins = N / 10 n = int(N / nbins) p, x = np.histogram(outputs, bins=n, normed=True) plt.hist(outputs, bins=n, normed=True) x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers # Now we want to interpolate : # f = CubicSpline(x=x, y=p, bc_type='not-a-knot') f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate') x = np.linspace(-2.9*std, 2.9*std, 10000) plt.plot(x, f(x)) plt.show() # To check : area = integrate.quad(f, x[0], x[-1]) print(area) # (should be close to 1)

This is what I get from the method above.

Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.

It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

Upvotes: 1

JAbr
JAbr

Reputation: 332

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

Upvotes: 2

Related Questions