user3356689
user3356689

Reputation: 55

How to do data fitting to find the distribution of given data

I need to do data fitting to find the distribution of a given data.

I need to find the pdf function of the distribution.

I can use data fitting functions in matlab and python.

It looks like a truncated gamma.

But, how to find the parameters of the distribution ?

What if the data cannot fit the truncated gamma well ?

The QQ-plot (qunatile-quantile) show that it is not a good fit for truncated gamma.

How to find the distribution parameters such as alpha (shape), beta (scale) for the truncated gamma ?

If data fitting cannot work here, what other methods I can use for that ?

Any help would be appreciated.

Upvotes: 0

Views: 1339

Answers (3)

erdogant
erdogant

Reputation: 1694

The distfit library can all the points you are asking for. It searches across 89 theoretical distributions to find the best with the loc, scale arg parameters. Example:

I need to do data fitting to find the distribution of a given data.

pip install distfit

from distfit import distfit
import numpy as np

# Create dataset
X = np.random.normal(0, 2, 1000)

# Default method is parametric.
dfit = distfit(distr='popular') # 'all' for all 89 or specify your own list.
# Search for best theoretical fit on your empirical data
results = dfit.fit_transform(X)

# [distfit] >INFO> fit
# [distfit] >INFO> transform
# [distfit] >INFO> [norm      ] [0.00 sec] [RSS: 0.00221125] [loc=0.002 scale=1.902]
# [distfit] >INFO> [expon     ] [0.00 sec] [RSS: 0.189094] [loc=-5.527 scale=5.529]
# [distfit] >INFO> [pareto    ] [0.00 sec] [RSS: 0.189094] [loc=-536870917.527 scale=536870912.000]
# [distfit] >INFO> [dweibull  ] [0.03 sec] [RSS: 0.00285214] [loc=0.005 scale=1.667]
# [distfit] >INFO> [t         ] [0.18 sec] [RSS: 0.00221128] [loc=0.002 scale=1.902]
# [distfit] >INFO> [genextreme] [0.08 sec] [RSS: 0.00295224] [loc=-0.699 scale=1.892]
# [distfit] >INFO> [gamma     ] [0.02 sec] [RSS: 0.00221712] [loc=-2156.094 scale=0.002]
# [distfit] >INFO> [lognorm   ] [0.13 sec] [RSS: 0.00235342] [loc=-149.018 scale=149.010]
# [distfit] >INFO> [beta      ] [0.03 sec] [RSS: 0.00208639] [loc=-11.915 scale=24.259]
# [distfit] >INFO> [uniform   ] [0.00 sec] [RSS: 0.122291] [loc=-5.527 scale=11.639]
# [distfit] >INFO> [loggamma  ] [0.06 sec] [RSS: 0.00209695] [loc=-376.616 scale=55.771]
# [distfit] >INFO> Compute confidence intervals [parametric]

fig, ax = plt.subplots(1, 2, figsize=(20, 10))
dfit.plot(chart='PDF', ax=ax[0])
dfit.plot(chart='CDF', ax=ax[1])

PDF/CDF plot

The best fit can be found in the results which will answer this question:

I need to find the pdf function of the distribution.

print(results['model'])

{'distr': <scipy.stats._continuous_distns.beta_gen at 0x155d67e35b0>,
 'stats': 'RSS',
 'params': (19.490647075756037,
  20.18413144061353,
  -11.915134641255602,
  24.25907054997436),
 'name': 'beta',
 'model': <scipy.stats._distn_infrastructure.rv_continuous_frozen at 0x15583c00760>,
 'score': 0.002086393123419647,
 'loc': -11.915134641255602,
 'scale': 24.25907054997436,
 'arg': (19.490647075756037, 20.18413144061353),
 'CII_min_alpha': -3.124111310171669,
 'CII_max_alpha': 3.141196658807374}

But, how to find the parameters of the distribution ?

results['model']['params']

What if the data cannot fit the truncated gamma well ?

# Take one of the other fitting models.
results['summary'][['distr', 'score']]

# Plot results
dfit.plot_summary()

Fit results

The QQ-plot (qunatile-quantile) show that it is not a good fit for truncated gamma.

# Make qqplot for the best fit.
dfit.qqplot(X)

QQplot

# Inspect all other fits
dfit.qqplot(X, n_top=10)

QQplot other fits

If data fitting cannot work here, what other methods I can use for that ?

You can use non-parametric methods:

# Quantile method
dfit = distfit(method='quantile')

# Percentile method
dfit = distfit(method='percentile')

Any help would be appreciated.

Disclaimer: I am also the author of this repo.

Upvotes: 1

Jean A.
Jean A.

Reputation: 301

Maybe this post can help.

I presented an example of how to find the best distribution according to BIC criterium using OpenTURNS.

You define a list 'Distribution Factories' tested_distributions = [ot.WeibullMaxFactory(), ot.NormalFactory(), ot.UniformFactory()]

then you call BestModelBIC to find the best fitting best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_distributions)

Currently, you can choose among 30 available 'Factories' in OpenTURNS (see below). TruncatedNormalFactory is available but not yet TruncatedBetaFactory

print(ot.DistributionFactory.GetContinuousUniVariateFactories())

[Out]: 
[ArcsineFactory,
BetaFactory, 
BurrFactory, 
ChiFactory, 
ChiSquareFactory, 
DirichletFactory,
ExponentialFactory,
FisherSnedecorFactory,
FrechetFactory,
GammaFactory,
GeneralizedParetoFactory,
GumbelFactory,HistogramFactory,
InverseNormalFactory,
LaplaceFactory,LogisticFactory,
LogNormalFactory,
LogUniformFactory,
MeixnerDistributionFactory,
NormalFactory,
ParetoFactory,
RayleighFactory,
RiceFactory,
StudentFactory,
TrapezoidalFactory,
TriangularFactory,
TruncatedNormalFactory,
UniformFactory,
WeibullMaxFactory,
WeibullMinFactory]
#30

Upvotes: 0

pjs
pjs

Reputation: 19853

Check out allfitdist in Matlab.

Alternatively, consider specialty packages such as ExpertFit or EasyFit. The JMP statistical software also has a fairly easy to use distribution fitting option. All of these will evaluate goodness-of-fit criteria such as Cramer-von Mises and log likelihood estimates.

Once you've picked the functional form of the distribution, parameter values are usually estimated by maximum likelihood estimators or method of moments.

If you're planning on using the results in a simulation of some sort, you might consider just bootstrapping your sample rather than distribution fitting. Yet another option if simulating would be to run a designed experiment where you vary the distribution choice and see if alternatives have a significant impact on your results before worrying too much about fitting just the right distribution.

Upvotes: 1

Related Questions