Vilmos Foltenyi
Vilmos Foltenyi

Reputation: 21

Fitting curve using real Gauss distribution in Python

I have a set of points, their scattered image resemblances to a Gaussian normal distribution. Searching the internet there are many Python sample how to fit a curve to the points. They based on:

def Gauss1(X, C, mu, sigma):
    return C * np.exp(-(X-mu) ** 2 / (2 * sigma ** 2))

and

from scipy.optimize import curve_fit

This produce a very well fit curve. The problem is that Gauss1 is not the Gaussian normal distribution, it should be:

def Gauss2(X, mu, sigma):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-(X-mu) ** 2 / (2 * sigma ** 2))

Using Gauss2 produces the next message and there is no popt:

OptimizeWarning: Covariance of the parameters could not be estimated

Is any way to generate the fitting curve using the real Gaussian distribution?

popt2, pcov2 = curve_fit(Gauss2, x, y)

OptimizeWarning: Covariance of the parameters could not be estimated

I think the points are not needed, but I can publish 25 lines.

import matplotlib.pyplot as plt # Follow the convention
import numpy as np              # Library for working with arrays
from scipy.optimize import curve_fit

######################################## G L O B A L S
# X the ratings after grouping by 23 the chess players from FIDE 2024 Oct Standard
# Y is the average of players in the averaged ratings
XY = (
    (1411, 231), (1434, 271), (1457, 281), (1480, 287), 
    (1503, 292), (1526, 298), (1549, 293), (1572, 299), 
    (1595, 300), (1618, 303), (1641, 304), (1664, 308), 
    (1687, 301), (1710, 311), (1733, 304), (1756, 291), 
    (1779, 286), (1802, 283), (1825, 279), (1848, 268), 
    (1871, 260), (1894, 243), (1917, 229), (1940, 221), 
    (1963, 203), (1986, 169), (2009, 134), (2032, 112), 
    (2055, 102), (2078,  94), (2101,  81), (2124,  76), 
    (2147,  70), (2170,  61), (2193,  54), (2216,  45), 
    (2239,  42), (2262,  35), (2285,  31), (2308,  27), 
    (2331,  26), (2354,  22), (2377,  20), (2400,  18), 
    (2423,  15), (2446,  10), (2469,  10), (2492,   7), 
    (2515,   6), (2538,   5), (2561,   4), (2584,   3), 
    (2607,   3), (2630,   2), (2653,   2), (2676,   2), 
    (2699,   1), (2722,   2), (2745,   1), (2768,   1), 
    (2791,   1), (2837,   1))

class c:  # Constants
    X_LABEL_STEP = 50
    Y_LABEL_STEP = 10
    A4 = (11.69, 8.27)  # W x H in inches

###################################### F U N C T I O N S

def Graph_Gauss():
    global c, XY

    X = [] ; Y = []
    for e in XY:
        X += [e[0]]
        Y += [e[1]]

    # Set up the limits for X, ratings
    minX = min(X) ; maxX = max(X)
    # Set up the limits for Y, number of players with that rating
    minY = min(Y) ; maxY = max(Y)
        
    fig, ax = plt.subplots()
    fig.set_size_inches(c.A4)
    fig.suptitle("Rating Distribution", fontsize=18, y=0.935)

    x1  = []  # Make the x axle
    xlb = []  # Make the labels for x
    stp = c.X_LABEL_STEP
    for k in range(stp*(minX//stp), stp*(maxX//stp + 2), stp):
        x1 += [k]
        xlb.append(f"{k}")
    ax.set_xticks(ticks=x1, labels=xlb, rotation=270)
    ax.set_xlabel("Rating", fontsize=15, labelpad=7)

    stp = c.Y_LABEL_STEP
    yticks = np.arange(stp*(minY//stp), stp*(maxY//stp + 2), stp)
    ax.set_ylabel("Number of Players", fontsize=15, rotation=270, labelpad=18)
    ax.set_yticks(ticks=yticks)
    ax.grid(which="major", axis="both")

    ax.scatter(X, Y, color="#000FFF", marker='o', s=14)

    # plt.show()

    # Fit a normal distribution, aka Gaussian fitting curve
    # https://www.wasyresearch.com/tutorial-python-for-fitting-gaussian-distribution-on-data/
    # mu = mean = sum(x) / len(x) ; In our case sum(x * y) / sum(y)
    # sigma = standard deviation = sqrt((sum((x - mean)**2) / len(x))
    #
    # https://www.scribbr.com/statistics/normal-distribution/
    # 1/(sigma*sqrt(2*pi)) * e**(-(x - mean)**2 / (2 * sigma**2))
    # Calculating the Gaussian PDF values given Gaussian parameters and random variable X
    def Gauss1(X, C, mu, sigma):
        return C * np.exp(-(X-mu)**2 / (2 * sigma**2))
    def Gauss2(X, mu, sigma):
        return (1/(sigma*np.sqrt(2*np.pi))) * np.exp(-(X-mu)**2 / (2 * sigma**2))

    x = np.array(X)
    y = np.array(Y)
    mu    = sum(x * y) / sum(y)                  
    sigma = np.sqrt(sum(y*(x - mu)**2)/sum(y))

    print(f"{1/(sigma*np.sqrt(2*np.pi))=:.2f} {mu=:.2f} {sigma=:.2f}")

    popt1, pcov1 = curve_fit(Gauss1, x, y, p0=[max(y), mu, sigma], maxfev=5000)
    popt2, pcov2 = curve_fit(Gauss2, x, y)  # This generates:
    # OptimizeWarning: Covariance of the parameters could not be estimated

    yg1 = Gauss1(x, *popt1)
    yg2 = Gauss2(x, *popt2)
    
    ax.plot(x, yg1, color="#FF0F00", linewidth=3,
            label=f"Normal: mu={popt1[1]:.2f}, sigma={popt1[2]:.2f}")
    plt.legend(fontsize=12)

    breakpoint()  # ???? DEBUG

    plt.show()
    pass  # To set a breakpoint

#######################################################################
if __name__ == '__main__':
    # breakpoint()  # ???? DEBUG, to set other breakpoints
    Graph_Gauss()

Added on Oct. 28, 2024 Hi Aldehyde and Lastchance, Thank you very much for your answers, it is getting clearer. Now I have two questions: • What is exactly the Gaussian distribution function, where is the sigma; I couldn’t figure out from the many internet search hits. • For using the Gauss2() you normalized the y values by 230,000 and this produced a well-fitting curve. I tried a few other numbers, 240,000 even gives a better fit. Where this or a similar number comes from. I tried np.sum(y)=8,241 and the area below the dots np.sum((x[1:]-x[:-1]) * ((y[1:]+y[:-1])/2)) = 186,898, they gave a bad fit. So what is the right method, how can 240,000 or something similar number be deducted from the given 62 points. If you can recommend a text or some other book, where I can find the answer, it is also fine. Thanks, again, Vilmos

Upvotes: 2

Views: 112

Answers (1)

lastchance
lastchance

Reputation: 6640

You have two major errors.

  1. The data that you provide are BINNED FREQUENCIES, not PROBABILITY DENSITIES. However, the function Gauss2 that you are trying to fit is a probability density function (whose integral is 1).

  2. You haven't provided all the data, so it's quite difficult to convert from a binned frequency to a probability density because you don't know the total number of people! (I've estimated it below.)

The connection between a binned frequency and a pdf is

binned frequency = number x pdf x dx

where number is the total number of people and dx is the bin width. Thus, you can fit your Gauss2 function to Y/(number*dx) rather than Y itself. When you come to plot it then you have to reverse that to get the frequencies.

I have had to estimate number=10000 (by assuming that it's about "twice the number after the maximum") because you didn't provide all the data; you can correct this in the code below from your full database.

Change your lines

popt1, pcov1 = curve_fit(Gauss1, x, y, p0=[max(y), mu, sigma], maxfev=5000)
popt2, pcov2 = curve_fit(Gauss2, x, y)
yg1 = Gauss1(x, *popt1)
yg2 = Gauss2(x, *popt2)
ax.plot(x, yg1, color="#FF0F00", linewidth=3,
        label=f"Normal: mu={popt1[1]:.2f}, sigma={popt1[2]:.2f}")

to the following (correcting number from the entire dataset):

popt1, pcov1 = curve_fit(Gauss1, x, y, p0=[max(y), mu, sigma], maxfev=5000)
yg1 = Gauss1(x, *popt1)

number = 10000   # this has been estimated: you didn't provide all of the data
dx     = 23      # bin width
popt2, pcov2 = curve_fit(Gauss2, x, y/(number*dx), p0=[mu, sigma] )
yg2 = Gauss2(x, *popt2) * number * dx

ax.plot(x, yg1, color="#FF0000", linewidth=3,
        label=f"Normal: mu={popt1[1]:.2f}, sigma={popt1[2]:.2f}")
ax.plot(x, yg2, color="#00FF00", linewidth=3,
        label=f"Normal: mu={popt2[0]:.2f}, sigma={popt2[1]:.2f}")

enter image description here

Upvotes: 1

Related Questions