Reputation: 557
I have a set of experimental values and I want to find the function that describes their distribution better. But in the process of tinkering with some functions, I discovered that scipy.optimize.curve_fit and scipy.stats.rv_continuous.fit give very different results, usually not in favor for the latter. Here is a simple example:
#!/usr/bin/env python3
import numpy as np
from scipy.optimize import curve_fit as fit
from scipy.stats import gumbel_r, norm
import matplotlib.pyplot as plt
amps = np.loadtxt("pyr_11.txt")*-1000 # http://pastebin.com/raw.php?i=uPK31JGE
argsGumbel0 = gumbel_r.fit(amps)
argsGauss0 = norm.fit(amps)
bins = np.arange(60)
probs, binedges = np.histogram(amps, bins=bins, normed=True)
bincenters = 0.5*(binedges[1:]+binedges[:-1])
argsGumbel1 = fit(gumbel_r.pdf, bincenters, probs, p0=argsGumbel0)[0]
argsGauss1 = fit(norm.pdf, bincenters, probs, p0=argsGauss0)[0]
plt.figure()
plt.hist(amps, bins=bins, normed=True, color='0.5')
xes = np.arange(0, 60, 0.1)
plt.plot(xes, gumbel_r.pdf(xes, *argsGumbel0), linewidth=2, label='Gumbel, maximum likelihood')
plt.plot(xes, gumbel_r.pdf(xes, *argsGumbel1), linewidth=2, label='Gumbel, least squares')
plt.plot(xes, norm.pdf(xes, *argsGauss0), linewidth=2, label='Gauss, maximum likelihood')
plt.plot(xes, norm.pdf(xes, *argsGauss1), linewidth=2, label='Gauss, least squares')
plt.legend(loc='upper right')
plt.show()
The difference in performance varies from dramatic to mild, but in my case it is always present. Why is that so? How do I choose the most appropriate optimisation method for the case?
Upvotes: 2
Views: 1952
Reputation:
Don't take this entirely as an answer, because I don't have reputation enough for comment. The fault for that bad performance is not because scipy do anything wrong, but because the model itself don't represent the data. The maximum likelyhood will work on the mean prevanlently on this case, while least squares will attemp to be near to the curve. That's why gaussian maximum likelyhood perform bad. It doesn't consider all the data, but a few properties of the distribution.
For your problem I would reccomend using a Landau distribution for fitting.
Upvotes: 1