Reputation: 29
I wish to create a continuous probability distribution from this dataset.
The 'Value' shows a measured value and the 'Weight' is the probability of measuring this value in this measurement.
I already graphed the data. On the x-axis it shows the value, and the Y-axis the probability. But I wish to create an exact distribution to fit this data.
In my data-analysis I eventually wish to compare several data distributions by their parameters. I hope you guys can help me out.
Line # | Value | Weight |
---|---|---|
0 | 0.0538502 | 0.016508 |
1 | 0.0184823 | 0.0298487 |
2 | 0.0647929 | 0.0122637 |
3 | 0.0262852 | 0.0234716 |
4 | 0.0447611 | 0.0197072 |
5 | 0.0643164 | 0.0165399 |
6 | 0.0709176 | 0.0143751 |
7 | 0.0871276 | 0.012253 |
8 | 0.0341064 | 0.0197392 |
9 | 0.0593696 | 0.0143858 |
10 | 0.0436119 | 0.0202617 |
11 | 0.0505131 | 0.0191846 |
12 | 0.0378706 | 0.0207842 |
13 | 0.0298233 | 0.0250712 |
14 | 0.157727 | 0.0111866 |
15 | 0.0556603 | 0.0186408 |
16 | 0.0542849 | 0.017617 |
17 | 0.0395772 | 0.0180969 |
18 | 0.0694962 | 0.0117305 |
19 | 0.0343318 | 0.0229277 |
20 | 0.139291 | 0.00907511 |
22 | 0.0232517 | 0.0186514 |
23 | 0.207768 | 0.0069423 |
24 | 0.0156452 | 0.021872 |
25 | 0.117749 | 0.0100989 |
26 | 0.124017 | 0.0111973 |
27 | 0.0679313 | 0.0133407 |
28 | 0.0733413 | 0.0117198 |
29 | 0.100553 | 0.0133407 |
30 | 0.0695865 | 0.016508 |
31 | 0.117732 | 0.0138633 |
32 | 0.0540577 | 0.0170518 |
33 | 0.0736274 | 0.0170625 |
34 | 0.0332381 | 0.0293155 |
35 | 0.0803423 | 0.0159961 |
36 | 0.0465 | 0.0191846 |
37 | 0.0889299 | 0.0159854 |
38 | 0.053232 | 0.020251 |
39 | 0.131361 | 0.0122637 |
40 | 0.0233194 | 0.0240048 |
41 | 0.830735 | 0.0053107 |
42 | 0.341012 | 0.0069423 |
43 | 0.101263 | 0.0106534 |
44 | 0.127061 | 0.00959765 |
45 | 0.13706 | 0.0122637 |
46 | 0.120035 | 0.0106641 |
47 | 0.0801194 | 0.0138526 |
48 | 0.0617996 | 0.0165186 |
49 | 0.197555 | 0.0117305 |
50 | 0.0810635 | 0.0133301 |
51 | 0.0178539 | 0.0335811 |
52 | 0.0391433 | 0.0170518 |
53 | 0.0663863 | 0.0133194 |
54 | 0.0617675 | 0.0170625 |
55 | 0.00684359 | 0.0346582 |
56 | 0.0642299 | 0.0133301 |
57 | 0.00970105 | 0.0239941 |
58 | 0.0307687 | 0.0213068 |
59 | 0.0160796 | 0.0255937 |
60 | 0.0147901 | 0.0266388 |
61 | 0.073745 | 0.0122637 |
62 | 0.0420728 | 0.0207949 |
63 | 0.0211625 | 0.0207949 |
66 | 0.0241562 | 0.0255937 |
67 | 0.0329688 | 0.0239834 |
68 | 0.0739628 | 0.0181289 |
69 | 0.0149927 | 0.0266388 |
70 | 0.0130271 | 0.0378467 |
73 | 0.0107957 | 0.0351914 |
74 | 0.040447 | 0.0175744 |
75 | 0.00123215 | 0.0559756 |
76 | 0.0134575 | 0.0309151 |
77 | 0.00592594 | 0.0453116 |
Upvotes: 1
Views: 1214
Reputation: 26201
It looks like the data you have is a sort of (non-normalized) histogram.
The first task is of course to plot it:
df = df.sort_values('Value')
plt.plot(df['Value'], df['Weight'])
plt.xlabel('value')
plt.ylabel('weight')
At first glance, it could indicate an exponential or a power-law distribution, but let's see.
Let's first try to smooth out that curve:
import statsmodels.api as sm
x, w = df['Value'].values, df['Weight'].values
s = pd.DataFrame(sm.nonparametric.lowess(w, x, frac=0.2), columns=['x', 'w']).set_index('x').squeeze()
s = s.reindex(np.linspace(x.min(), x.max(), 200), method='ffill', limit=1).interpolate()
s.plot()
plt.plot(x, w, '.')
That gives an okay-ish fit:
We can then use that to generate a fake, crude "sample" following that smooth pdf:
sample = np.random.choice(s.index, p=s/s.sum(), size=1000)
At that point, you can make QQ plots with various distributions following your intuition, and select one that seems to fit well:
from scipy.stats import _continuous_distns as distns
# trying a normal (the default)
sm.qqplot(sample, line='q')
plt.title('Normal')
Clearly not a good fit at all (but we knew that from a first glance at the data):
# trying an exponential
sm.qqplot(sample, distns.expon, line='q')
plt.title('Exponential')
Not very good either:
Perhaps a power-law would fit better?
# we are only interested in the parameter a, so we are
# not going to let loc and scale be fitted;
# instead, we will freeze them at loc=0, scale=1
a, loc, size = distns.powerlaw.fit(sample, floc=0, fscale=1)
# then, we do the QQ plot with the fitted parameter a
sm.qqplot(sample, distns.powerlaw, distargs=(a,), line='q')
plt.title(f'Power law with a={a}')
Corresponding distribution and how to use it
You can now instantiate a distribution following what was found (type and parameters), draw random variates from it, and also plot the pdf directly for comparison purposes with the original data:
g = distns.powerlaw(a=a)
# new points drawn according to g
v = g.rvs(size=100000)
plt.hist(v, bins=100, density=True, histtype='step');
Direct pdf plot and comparison with the original data:
y = g.pdf(x)
plt.plot(x, y/y.sum())
plt.plot(x, w/w.sum(), '.')
plt.title('Normalized pdf and original sample data')
Last word
So, where to go from here? You should look in depth into that distribution and its physical meaning, and see if that makes sense in the context of your experimental setup.
Upvotes: 3