negfrequency
negfrequency

Reputation: 1897

Creating a best fit probability distribution from pdf sample coordinates with scipy

Problem: I have data points indicating coordinates sampled from a probability distriabution (in this case we will assume a discrete probability distribution function) We are essentially forming a 'best fit of a pdf' from pdf data here.

Given: sample coordinates of pdf and the type of pdf type to fit to it (e.g. lognorm)

Return: Ideally the pdf parameters, or, alternatively the coordinates of the best fit distribution.

I have not found a question on stackoverflow with this question/answer and I understand it may be poor practice. It seems that scipy explicitly likes the original data to build the pdf parameters from, not sample coordinates from a pdf.

I have vectors whereby:

x = list(range(40))

y = 
[0.032935611986072325,
 0.15399668951796566,
 0.19217568076280733,
 0.16189644686218774,
 0.11504756998080325,
 0.09474568682103104,
 0.08971162676825704,
 0.06198299715985481,
 0.04408241680044377,
 0.026817519111333753,
 0.013562814925870696,
 0.007007365243147507,
 0.003909173588759217,
 0.0015053452905258473,
 0.00037481359597322736,
 0.0001378624720821066,
 5.734365756863486e-05,
 2.9711739672867803e-05,
 8.022169711674307e-06,
 5.942347934573561e-06,
 2.228380475465085e-06,
 3.7139674591084754e-06,
 8.913521901860341e-07,
 8.913521901860341e-07,
 5.94234793457356e-07,
 2.97117396728678e-07,
 2.97117396728678e-07,
 2.97117396728678e-07,
 1.48558698364339e-07,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

Upvotes: 0

Views: 761

Answers (2)

Jean A.
Jean A.

Reputation: 301

I think your data represents a pdf {x, y = pdf(x)} since sum(y) = 1. When we plot your data with a slight correction x = list(range(39)) we get a curve similar to a lognormal (?).

import matplotlib.pyplot as plt

x = list(range(39))
plt.plot(x, y)

enter image description here

One trick you could use to avoid optimisation algorithms is to transform your data into a sample since each y[i] is proportional to the frequency of x[i] . In other words, if you want a 'perfect' sample S of size N, each x[i] will appear N * y[i] times.

N = 20.000
n_times = [int(y[i] * N) for i in range(len(y))]
S = np.repeat(x, n_times)

All that remains to be done is to fit a LogNormal distribution to S. Personally, I am used to OpenTURNS library. You just need to format S as an ot.Sample by reshaping into N points of dimension 1

import openturns as ot

sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().build(sample)

fitdist is an "ot.Distribution", you can print to see its parameters

print(fitdist)
>>> LogNormal(muLog = 1.62208, sigmaLog = 0.45679, gamma = -1.79583)

or plot both curves using fitdist.computePDF built-in method which takes as argument ot.Sample format

plt.plot(x, y)
plt.plot(x, fitdist.computePDF(ot.Sample([[p] for p in x])))

enter image description here

Upvotes: 0

Julius
Julius

Reputation: 765

Calling your PDF f(x):

If your data really represents {x, f(x)} then you could try simply optimizing for the parameters of f using e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html#scipy.optimize.leastsq

If your data on the other hand are samples from the probability distribution, i.e. your data looks like {x} but each x is chosen with probability f(x), then you should try Markov Chain Monte Carlo to estimate f. There are several choices for Python:

https://pystan.readthedocs.io/en/latest/

http://docs.pymc.io/notebooks/getting_started.html#Model-fitting

Upvotes: 1

Related Questions