Gaussian fit to histogram on python seems off. What could I change to improve the fit?

Question

I have created a Gaussian fit to data plotted as a bar chart. However, the fit does not look right, and I don't know what to change to improve the fit. My code is as follows:

import matplotlib.pyplot as plt
import math
import numpy as np
from collections import Counter
import collections
from scipy.optimize import curve_fit
from scipy.stats import norm
from scipy import stats
import matplotlib.mlab as mlab

k_list = [-40, -32, -30, -28, -26, -24, -22, -20, -18, -16, -14, -12, -10, -8, -6, -4, -3, -2, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34]
v_list = [1, 2, 11, 18, 65, 122, 291, 584, 1113, 2021, 3335, 5198, 7407, 10043, 12552, 14949, 1, 16599, 16770, 16728, 14772, 12475, 9932, 7186, 4987, 3286, 1950, 1080, 546, 285, 130, 54, 18, 11, 2, 2]

def func(x, A, beta, B, mu, sigma):
    return (A * np.exp(-x/beta) + B * np.exp(-100.0 * (x - mu)**2 / (2 * sigma**2))) #Normal distribution

popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=[10000, 5, 10000, 10, 10])
print(popt)


x = np.linspace(-50, 50, 1000)

plt.bar(k_list, v_list, label='myPLOT', color = 'b', width = 0.75)
plt.plot(x, func(x, *popt), color='darkorange', linewidth=2.5, label=r'Fitted function')

plt.xlim((-30, 45))
plt.legend()
plt.show()

The plot I obtain is as follows:

Bar plot with fitted Gaussian curve

How can I adjust my fit?

Mad Physicist · Accepted Answer

You have a significant outlier here, possibly caused by a typo: (k, v) == (-3, 1) at index 16 in the data.

The representation of the data as a bar chart is not optimal here. The issue would be clearly visible if you showed the data in the same format as you show the fit. Either of the following would work better:

The outlier forces the peak down. Here is the fit if we remove the outlier manually:

You can remove the outlier automatically by checking its individual residual against the RMSE of the entire fit:

popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=[10000, 5, 10000, 10, 10])
resid = np.abs(func(k_list, *popt) - v_list)
rmse = np.std(resid)
keep = resid < 3 * rmse
if keep.sum() < keep.size:
     popt, pcov = curve_fit(func, xdata=k_list[keep], ydata=v_list[keep], p0=popt)

Or even a repeated application:

popt = [10000, 5, 10000, 10, 10]
while True:
    popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=popt)
    resid = np.abs(func(k_list, *popt) - v_list)
    rmse = np.std(resid)
    keep = resid < 5 * rmse
    if keep.sum() == keep.size:
        break
    k_list = k_list[keep]
    v_list = v_list[keep]

A 3-sigma outlier will trim everything off your data after a couple of iterations, so I used 5-sigma. Keep in mind that this is a very quick and dirty way to denoise data. It's really basically manual, since you have to re-check the data to make sure that your choice of factor was correct.

Gaussian fit to histogram on python seems off. What could I change to improve the fit?

Answers (1)

Related Questions