Reputation: 6213
Problem: I want to fit empirical data to a bimodal normal distribution from which I know from the physical context the distance of the peaks (fixed) and also that both peaks must have the same standard deviation.
I was trying to create an own distribution with scipy.stats.rv_continous
(see code below), but the parameters are always fitted to 1. Does someone understand what's going on, or can point me to a different approach to solve the problem?
Details: I avoided the loc
and scale
parameters and implemented them as m
and s
directly into the _pdf
-method since the peak distance delta
shall not be affected by scale
. To compensate that, I fixed them to floc=0
and fscale=1
in the fit
-method and actually want fit-parameters for m
, s
and the weights of the peaks w
What I expect in the sample data is a distribution with peaks around x=-450
and x=450
(=> m=0
). The stdev s
should be around 100 or 200, but not 1.0, and the weight w
should be approx. 0.5
from __future__ import division
from scipy.stats import rv_continuous
import numpy as np
class norm2_gen(rv_continuous):
def _argcheck(self, *args):
return True
def _pdf(self, x, m, s, w, delta):
return np.exp(-(x-m+delta/2)**2 / (2. * s**2)) / np.sqrt(2. * np.pi * s**2) * w + \
np.exp(-(x-m-delta/2)**2 / (2. * s**2)) / np.sqrt(2. * np.pi * s**2) * (1 - w)
norm2 = norm2_gen(name='norm2')
data = [487.0, -325.5, -159.0, 326.5, 538.0, 552.0, 563.0, -156.0, 545.5, 341.0, 530.0, -156.0, 473.0, 328.0, -319.5, -287.0, -294.5, 153.5, -512.0, 386.0, -129.0, -432.5, -382.0, -346.5, 349.0, 391.0, 299.0, 364.0, -283.0, 562.5, -42.0, 214.0, -389.0, 42.5, 259.5, -302.5, 330.5, -338.0, 508.5, 319.5, -356.5, 421.5, 543.0]
m, s, w, delta, loc, scale = norm2.fit(data, fdelta=900, floc=0, fscale=1)
print m, s, w, delta, loc, scale
>>> 1.0 1.0 1.0 900 0 1
Upvotes: 3
Views: 5030
Reputation: 114781
I was able to get your distribution to fit the data after making a couple tweaks:
w
as you did, you have an implied constraint that 0 <= w
<= 1. The solver used by the fit()
method isn't aware of this constraint, so w
could end up with unreasonable values. One way to handle this type of constraint is to allow w
to be an arbitrary real value, but in the formula for the PDF, convert w
to a fraction phi
between 0 and 1 using phi = 0.5 + arctan(w)/pi
.fit()
method uses a numerical optimization routine to find the maximum likelihood estimate. Like most such routines, it needs a starting point for the optimization. The default starting point is all 1s, but this doesn't always work well. You can choose a different starting point by providing the values as positional arguments to fit()
after the data. The values that I used in the script worked; I didn't explore how sensitive the results were to these starting values.I did two estimations. In the first, I let delta
be a free parameter, and in the second, I fixed delta
at 900.
The script below generates the following plot:
Here's the script:
from __future__ import division
from scipy.stats import rv_continuous
import numpy as np
import matplotlib.pyplot as plt
class norm2_gen(rv_continuous):
def _argcheck(self, *args):
return True
def _pdf(self, x, m, s, w, delta):
phi = 0.5 + np.arctan(w)/np.pi
return np.exp(-(x-m+delta/2)**2 / (2. * s**2)) / np.sqrt(2. * np.pi * s**2) * phi + \
np.exp(-(x-m-delta/2)**2 / (2. * s**2)) / np.sqrt(2. * np.pi * s**2) * (1 - phi)
norm2 = norm2_gen(name='norm2')
data = [487.0, -325.5, -159.0, 326.5, 538.0, 552.0, 563.0, -156.0, 545.5,
341.0, 530.0, -156.0, 473.0, 328.0, -319.5, -287.0, -294.5, 153.5,
-512.0, 386.0, -129.0, -432.5, -382.0, -346.5, 349.0, 391.0, 299.0,
364.0, -283.0, 562.5, -42.0, 214.0, -389.0, 42.5, 259.5, -302.5,
330.5, -338.0, 508.5, 319.5, -356.5, 421.5, 543.0]
# In the fit method, the positional arguments after data are the initial
# guesses that are passed to the optimization routine that computes the MLE.
# First let's see what we get if delta is not fixed.
m, s, w, delta, loc, scale = norm2.fit(data, 1.0, 1.0, 0.0, 900.0, floc=0, fscale=1)
# Fit the disribution with delta fixed.
fdelta = 900
m1, s1, w1, delta1, loc, scale = norm2.fit(data, 1.0, 1.0, 0.0, fdelta=fdelta, floc=0, fscale=1)
plt.hist(data, bins=12, normed=True, color='c', alpha=0.65)
q = np.linspace(-800, 800, 1000)
p = norm2.pdf(q, m, s, w, delta)
p1 = norm2.pdf(q, m1, s1, w1, fdelta)
plt.plot(q, p, 'k', linewidth=2.5, label='delta=%6.2f (fit)' % delta)
plt.plot(q, p1, 'k--', linewidth=2.5, label='delta=%6.2f (fixed)' % fdelta)
plt.legend(loc='best')
plt.show()
Upvotes: 6