tmkadamcz
tmkadamcz

Reputation: 218

Least squares fit of cumulative density function to some quantiles

Supposing a user gives some percentiles of a distribution like this, and we are trying to find the parameters of the distribution.

# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2,8),(0.4,12),(0.5,16),(0.9,30)]

The user specifies the family of the distribution (e.g. Normal). When there are more than 2 percentiles, the equation system is over-determined, so we'll want to find parameters that fit the input best, based on least squares.

I'm having trouble implementing this. In this minimal example below, curve_fit just returns the default value of 1 for both parameters. What am I doing wrong?

from scipy import stats
from scipy import optimize

# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2,8),(0.4,12),(0.5,16),(0.9,30)]

fit = optimize.curve_fit(
            lambda x,mu,sigma: stats.norm(mu,sigma).cdf(x),
            xdata=[x[1] for x in percentiles],
            ydata=[x[0] for x in percentiles])
print(fit[0])

Edit: JohanC pointed out that I the initial guesses are important. Here's the (crude) method I used to get an initial guess for the parameters from the user's input:

def percentiles_to_list(percentiles):
    out =[]
    i = 1
    c = 1
    for p,q in percentiles:
        if c == len(percentiles):
            number_to_append = int(100 - i)
        else:
            number_to_append = int(p*100-i)
        out += [q]*number_to_append
        i = p*100
        c += 1
    return out

def initial_guess(percentiles):
    lis = percentiles_to_list(percentiles)
    mean = np.mean(lis)
    stdev = np.std(lis)
    return mean,stdev

Upvotes: 0

Views: 779

Answers (1)

JohanC
JohanC

Reputation: 80509

To obtain a good fit, initial guesses for curve_fit are important. Here, the given 50% percentile is a perfect initial guess for mu. For sigma we can take any reasonable initial guess. The default guesses are all 1s, which seem too far off in this case. Apart from (or instead of) an initial guess, also parameter boundaries can be set for curve_fit.

Note that due to the strict monotonic function we could switch the role of x and y to check whether that would give a more satisfactory fit.

The following code tries both fits and shows the result graphically. Both coincide quite strong.

from matplotlib import pyplot as plt
import numpy as np
from scipy import stats
from scipy import optimize

# a list of (p,x) tuples, where P(X<x)=p
percentiles = [(0.2, 8), (0.4, 12), (0.5, 16), (0.9, 30)]

fit_params_ppf, _fit_covariances = optimize.curve_fit(lambda x, mu, sigma: stats.norm(mu, sigma).ppf(x),
                                                      xdata=[percent for percent, percentile in percentiles],
                                                      ydata=[percentile for percent, percentile in percentiles],
                                                      p0=[16, 8])
plt.scatter([percentile for percent, percentile in percentiles], [percent for percent, percentile in percentiles],
            label='given percentiles')
xs = np.linspace(stats.norm(*fit_params).ppf(0.001), stats.norm(*fit_params).ppf(0.999), 500)
plt.plot(xs, stats.norm(*fit_params).cdf(xs), 'b--',
         label=f'fit ppf: $\\mu={fit_params_ppf[0]:.2f}, \\sigma={fit_params_ppf[1]:.2f}$')

fit_params_cdf, _fit_covariances = optimize.curve_fit(lambda x, mu, sigma: stats.norm(mu, sigma).cdf(x),
                                                      xdata=[percentile for percent, percentile in percentiles],
                                                      ydata=[percent for percent, percentile in percentiles],
                                                      p0=[16, 8])
plt.plot(xs, stats.norm(*fit_params).cdf(xs), 'r:',
         label=f'fit cdf: $\\mu={fit_params_cdf[0]:.2f}, \\sigma={fit_params_cdf[1]:.2f}$')

plt.legend()
plt.show()

resulting plot

Upvotes: 1

Related Questions