Reputation: 11
I have data, only discrete data points, and know how the data is distributed, i.e:
y = w * gamma.pdf(x, alpha1, scale=scale1) + (1-w) * gamma.pdf(x, alpha2, scale=scale2)
How do you accurately infer these five parameters? My code are as follows:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.optimize import curve_fit
from scipy.stats import gaussian_kde,gamma
data1 = np.random.gamma(4, 1, 200)
data2 = np.random.gamma(6,2, 200)
data = np.concatenate((data1, data2))
x=np.linspace(0,np.max(data)+2,800)
y=gaussian_kde(data)(x)
initial_params = [0.5, 2, 1, 2, 1]
params, params_covariance = curve_fit(two_gamma, x, y, p0=initial_params, maxfev=50000)
w, alpha1, scale1, alpha2, scala2 = params
plt.figure(figsize=(10, 6))
sns.histplot(data, bins=20, kde=False, color='y', label='Data density', alpha=0.5, stat='probability')
plt.plot(x, y, marker='o', linestyle='', markersize=1, label='Data distribution')
y_fit=w*gamma.pdf(x, alpha1, scale=scale1)+(1-w)*gamma.pdf(x, alpha2, scale=scala2)
plt.plot(x, y_fit, 'r-', linewidth=1, alpha=0.7,label='Mixture gamma distribution')
plt.legend(fontsize=8, loc='upper right')
plt.title("Expression distribution of gamma mixture")
plt.xlabel("Expression")
I use the curve_fit function to fit the form that does not yield a double peak, and I hope to be able to approximate the parameters of the function precisely. The use of neural networks and other methods are acceptable.
Upvotes: 1
Views: 47
Reputation: 11042
Lets state your model as follow:
import numpy as np
import pandas as pd
from scipy import stats, optimize
import matplotlib.pyplot as plt
def model(p, data):
return p[0] * stats.gamma.pdf(data, p[1], scale=p[2]) + (1 - p[0]) * stats.gamma.pdf(data, p[3], scale=p[4])
Which is also a PDF as it is weighted sum of PDFs where weights sum is unitary.
For the given set of parameters:
p = [0.5, 4, 1, 6, 2]
Assume we randomly sample from this distribution (naive implementation, heavy computation):
class BiGamma(stats.rv_continuous):
def _pdf(self, x, w, a1, s1, a2, s2):
return model([w, a1, s1, a2, s2], x)
bigamma = BiGamma(shapes='w, a1, s1, a2, s2')
law = bigamma(*p)
data = law.rvs(3000)
pd.DataFrame({"x": data}).to_csv("data.csv", index=False)
Or use this sample for convenience:
data = pd.read_csv("https://pastebin.com/raw/1JpzfT2E")["x"].values
If sampled data are IID, we can perform the a MLE to regress the parameters:
def likelihood(p, data):
return - np.sum(np.log10(model(p, data)))
p0 = [1, 5, 1, 5, 1]
sol = optimize.minimize(likelihood, x0=p0, args=(data,), bounds=[(0, 1)] + [(0, np.inf)] * 4)
# message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
# success: True
# status: 0
# fun: 3783.3002631839786
# x: [ 5.316e-01 5.375e+00 2.137e+00 4.123e+00 9.111e-01]
# nit: 31
# jac: [-3.060e-02 4.502e-03 1.546e-02 -1.396e-02 -4.925e-02]
# nfev: 192
# njev: 32
# hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
Notice we need to carefully choose the initial value and provides bounds in order to make the algorithm to converge.
If you only have binned data (histogram), then we can perform a NLLS directly on binned data instead of sampled data:
freq, bins = np.histogram(data, density=True, bins=35)
centers = (bins[:-1] + bins[1:]) * 0.5
def model_proxy(x, w, a1, s1, a2, s2):
return model([w, a1, s1, a2, s2], x)
popt, pcov = optimize.curve_fit(model_proxy, centers, freq, bounds=[[0] * 5, [1] + [np.inf] * 4])
# (array([0.51763272, 5.44778038, 2.14335038, 3.82036594, 1.00628071]),
# array([[ 0.00066593, -0.01337379, 0.00424389, 0.00256379, -0.00140308],
# [-0.01337379, 0.29725577, -0.09720244, -0.04583584, 0.0263838 ],
# [ 0.00424389, -0.09720244, 0.03236382, 0.0140486 , -0.00816902],
# [ 0.00256379, -0.04583584, 0.0140486 , 0.01511191, -0.00692502],
# [-0.00140308, 0.0263838 , -0.00816902, -0.00692502, 0.00344378]]))
As for MLE, NLLS needs an educated guess and bounds to make the algorithm to converge.
Finally we render solutions against data:
xlin = np.linspace(data.min(), data.max(), 200)
fig, axe = plt.subplots()
axe.hist(data, density=True, bins=35, alpha=0.75, label="Raw Data")
axe.plot(xlin, model(p, xlin), label="Model")
axe.plot(xlin, model(sol.x, xlin), label="MLE Fit")
axe.scatter(centers, freq, marker='.', label="Binned Data")
axe.plot(xlin, model_proxy(xlin, *popt), label="NLLS Fit")
axe.legend()
axe.grid()
Upvotes: 0