Reputation: 1897
Problem: I have data points indicating coordinates sampled from a probability distriabution (in this case we will assume a discrete probability distribution function) We are essentially forming a 'best fit of a pdf' from pdf data here.
Given: sample coordinates of pdf and the type of pdf type to fit to it (e.g. lognorm)
Return: Ideally the pdf parameters, or, alternatively the coordinates of the best fit distribution.
I have not found a question on stackoverflow with this question/answer and I understand it may be poor practice. It seems that scipy explicitly likes the original data to build the pdf parameters from, not sample coordinates from a pdf.
I have vectors whereby:
x = list(range(40))
y =
[0.032935611986072325,
0.15399668951796566,
0.19217568076280733,
0.16189644686218774,
0.11504756998080325,
0.09474568682103104,
0.08971162676825704,
0.06198299715985481,
0.04408241680044377,
0.026817519111333753,
0.013562814925870696,
0.007007365243147507,
0.003909173588759217,
0.0015053452905258473,
0.00037481359597322736,
0.0001378624720821066,
5.734365756863486e-05,
2.9711739672867803e-05,
8.022169711674307e-06,
5.942347934573561e-06,
2.228380475465085e-06,
3.7139674591084754e-06,
8.913521901860341e-07,
8.913521901860341e-07,
5.94234793457356e-07,
2.97117396728678e-07,
2.97117396728678e-07,
2.97117396728678e-07,
1.48558698364339e-07,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]
Upvotes: 0
Views: 761
Reputation: 301
I think your data represents a pdf {x, y = pdf(x)} since sum(y) = 1
.
When we plot your data with a slight correction x = list(range(39))
we get a curve similar to a lognormal (?).
import matplotlib.pyplot as plt
x = list(range(39))
plt.plot(x, y)
One trick you could use to avoid optimisation algorithms is to transform your data into a sample since each y[i]
is proportional to the frequency of x[i]
. In other words, if you want a 'perfect' sample S
of size N, each x[i]
will appear N * y[i]
times.
N = 20.000
n_times = [int(y[i] * N) for i in range(len(y))]
S = np.repeat(x, n_times)
All that remains to be done is to fit a LogNormal distribution to S. Personally, I am used to OpenTURNS library. You just need to format S
as an ot.Sample
by reshaping into N points of dimension 1
import openturns as ot
sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().build(sample)
fitdist
is an "ot.Distribution", you can print to see its parameters
print(fitdist)
>>> LogNormal(muLog = 1.62208, sigmaLog = 0.45679, gamma = -1.79583)
or plot both curves using fitdist.computePDF
built-in method which takes as argument ot.Sample
format
plt.plot(x, y)
plt.plot(x, fitdist.computePDF(ot.Sample([[p] for p in x])))
Upvotes: 0
Reputation: 765
Calling your PDF f(x)
:
If your data really represents {x, f(x)}
then you could try simply optimizing for the parameters of f
using e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html#scipy.optimize.leastsq
If your data on the other hand are samples from the probability distribution, i.e. your data looks like {x}
but each x
is chosen with probability f(x)
, then you should try Markov Chain Monte Carlo to estimate f
. There are several choices for Python:
https://pystan.readthedocs.io/en/latest/
http://docs.pymc.io/notebooks/getting_started.html#Model-fitting
Upvotes: 1