kezzos
kezzos

Reputation: 3221

Scikit learn, fitting a gaussian to a histogram

In scikit-learn fitting a gaussian peak using GMM seems to work with discrete data data points. Is there a way of using GMM with data which has already been binned, or aggregated into a histogram?

For example, the following code is a work-around which converts the binned data into discrete data points before fitting:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture

def fit_one_peak(x, linspace):
    gmm = mixture.GMM(n_components=1) # gmm for one components
    gmm.fit(x) # train it!
    m1 = gmm.means_
    w1 = gmm.weights_
    return np.exp(gmm.score_samples(linspace)[0]), m1[0][0], w1[0]

def convert_to_signal(d, s):
    c = []
    count = 0
    for i in s:
        for j in range(int(d[count])):  # No floats!
            c.append(i)
        count += 1
    return c

d = [0.5, 2, 5, 3, 1, 0.5]  # y data, which is already binned
s = [0, 1, 2, 3, 4, 5]  # x data

signal = convert_to_signal(d, s)
linspace = np.linspace(s[0], s[-1], len(s))
l, mean, weight = fit_one_peak(signal, linspace)
l = l*(np.max(d)/ np.max(l))  # Normalize the fitted y

fig = plt.figure()
plt.plot(s, d, label='Original')
plt.plot(linspace, l, label='Fitted')
plt.hist(signal, label='Re-binned')
plt.legend()

Upvotes: 3

Views: 4771

Answers (1)

A_A
A_A

Reputation: 2368

Perhaps you are confusing the concept of optimising a statistical model from a set of data points and fitting a curve through a set of data points.

Some of the scikit-learn code that is cited above, is trying to optimise a statistical model from a set of data points. In other words, in that case, you are trying to estimate the parameters of the probability distribution of a source that COULD HAVE generated the set of data points. For more information on this, you might want to go through the "Principles" section from this article. The way that this information is then presented to a viewer is a totally independent subject. For example, you can recover the parameters of your gaussian (i.e. mean and standard deviation) from the data points and then overlay a gaussian CURVE on top of your data histogram. For more information on this, please see this link.

When all you have is your histogram data, that is the frequency of occurrence of each data item within your dataset, then you have pairs of data points of the form [(x0,y0), (x1,y1), (x2,y2), ..., (xn,yn)]. In this case, you are trying to fit a CURVE through these particular data points and you can do this with something like least squares. For more information about this please see this, this and this link.

Therefore, to optimise your gaussian probability density function from a data set, you can use the GMM model of sklearn and feed it your data set directly (that is, feed it the original data that your histogram was based on)

If you ALREADY have the data of the histogram then you would be looking at functions such as curve_fit. Just a slight note here: Since you are trying to fit a probability distribution function, your data (that is the Y component of your HISTOGRAM data) would have to be normalised in order to have a sum of 1.0. To do this, simply divide each frequency count by the sum of all frequency counts.

For more information, you might want to check this, this and this link.

Hope this helps.

Upvotes: 5

Related Questions