MWH
MWH

Reputation: 363

How to fix the poor fitting of 1-D data?

I have data set (1-D), with only one independent column. I would like to fit any model to it in order to sample from that model. The raw data Data set

I tried various theoretical distributions from Fitter package (here https://pypi.org/project/fitter/), none of them works fine. Then i tried Kernel Density Estimation using sklearn. It is good, but i could not prevent negative values due to the way it works. Finally, i tried log normal, but it is not really perfect.

Code for log normal here

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from sklearn.metrics import  r2_score,mean_absolute_error,mean_squared_error

NN = 3915 # sample same number as original data set

df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")

eps = 0.1  # Additional term for c

"""
    Estimate parameters of log(c) as normal distribution
"""
df["c"] = df["c"] + eps
mu = np.mean(np.log(df["c"]))
s  = np.std(np.log(df["c"]))
print("Mean:",mu,"std:",s)


def simulate(N):
    c = []
    for i in range(N):
        c_s = np.exp(np.random.normal(loc = mu, scale = s, size=1)[0])
        c.append(round(c_s))
    return (c)


predicted_c = simulate(NN)


XX=scipy.arange(3915)
### plot C relation ###
plt.scatter(XX,df["c"],color='g',label="Original data")
plt.scatter(XX,predicted_c,color='r',label="Sample data")
plt.xlabel('Index')
plt.ylabel('c')
plt.legend()
plt.show()

original vs samples

fitted data

What i am looking for is how to improve the fitting, any suggestions or direction to models that may fit my data with a better accuracy is appreciated. Thanks

Upvotes: 1

Views: 219

Answers (1)

James Phillips
James Phillips

Reputation: 4657

Here is a graphical Python fitter for the scipy statistical distribution Double Gamma using your spreadsheet data, I hope this might be of some use as a Normal distribution seems to be a poor fit to this data set. The scipy documentation for dgamma is at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dgamma.html - incidentally,the double Weibull distribution fit almost as well.

plot

import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1  # Additional term for c

data = df["c"] + eps

P = ss.dgamma.fit(data)
rX = np.linspace(min(data), max(data), 50)
rP = ss.dgamma.pdf(rX, *P)

plt.hist(data,bins=25, normed=True, color='slategrey')

plt.plot(rX, rP, color='darkturquoise')
plt.show()

Upvotes: 1

Related Questions