Reputation: 218
Supposing a user gives some percentiles of a distribution like this, and we are trying to find the parameters of the distribution.
# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2,8),(0.4,12),(0.5,16),(0.9,30)]
The user specifies the family of the distribution (e.g. Normal). When there are more than 2 percentiles, the equation system is over-determined, so we'll want to find parameters that fit the input best, based on least squares.
I'm having trouble implementing this. In this minimal example below, curve_fit
just returns the default value of 1 for both parameters. What am I doing wrong?
from scipy import stats
from scipy import optimize
# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2,8),(0.4,12),(0.5,16),(0.9,30)]
fit = optimize.curve_fit(
lambda x,mu,sigma: stats.norm(mu,sigma).cdf(x),
xdata=[x[1] for x in percentiles],
ydata=[x[0] for x in percentiles])
print(fit[0])
Edit: JohanC pointed out that I the initial guesses are important. Here's the (crude) method I used to get an initial guess for the parameters from the user's input:
def percentiles_to_list(percentiles):
out =[]
i = 1
c = 1
for p,q in percentiles:
if c == len(percentiles):
number_to_append = int(100 - i)
else:
number_to_append = int(p*100-i)
out += [q]*number_to_append
i = p*100
c += 1
return out
def initial_guess(percentiles):
lis = percentiles_to_list(percentiles)
mean = np.mean(lis)
stdev = np.std(lis)
return mean,stdev
Upvotes: 0
Views: 779
Reputation: 80509
To obtain a good fit, initial guesses for curve_fit
are important. Here, the given 50% percentile is a perfect initial guess for mu
. For sigma
we can take any reasonable initial guess. The default guesses are all 1
s, which seem too far off in this case. Apart from (or instead of) an initial guess, also parameter boundaries can be set for curve_fit
.
Note that due to the strict monotonic function we could switch the role of x
and y
to check whether that would give a more satisfactory fit.
The following code tries both fits and shows the result graphically. Both coincide quite strong.
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats
from scipy import optimize
# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2, 8), (0.4, 12), (0.5, 16), (0.9, 30)]
fit_params_ppf, _fit_covariances = optimize.curve_fit(lambda x, mu, sigma: stats.norm(mu, sigma).ppf(x),
xdata=[percent for percent, percentile in percentiles],
ydata=[percentile for percent, percentile in percentiles],
p0=[16, 8])
plt.scatter([percentile for percent, percentile in percentiles], [percent for percent, percentile in percentiles],
label='given percentiles')
xs = np.linspace(stats.norm(*fit_params).ppf(0.001), stats.norm(*fit_params).ppf(0.999), 500)
plt.plot(xs, stats.norm(*fit_params).cdf(xs), 'b--',
label=f'fit ppf: $\\mu={fit_params_ppf[0]:.2f}, \\sigma={fit_params_ppf[1]:.2f}$')
fit_params_cdf, _fit_covariances = optimize.curve_fit(lambda x, mu, sigma: stats.norm(mu, sigma).cdf(x),
xdata=[percentile for percent, percentile in percentiles],
ydata=[percent for percent, percentile in percentiles],
p0=[16, 8])
plt.plot(xs, stats.norm(*fit_params).cdf(xs), 'r:',
label=f'fit cdf: $\\mu={fit_params_cdf[0]:.2f}, \\sigma={fit_params_cdf[1]:.2f}$')
plt.legend()
plt.show()
Upvotes: 1