Reputation: 21
I would like to apply the best fit CDF found by Fitter to each value in a number of panda data-frame columns by hopefully passing the Fitter results to Scipy Stats (or another library?).
I can get the distribution function easily enough from Fitter with the following code:
import numpy as np
import pandas as pd
import seaborn as sns
from fitter import Fitter
from fitter import get_common_distributions
from fitter import get_distributions
dataset = pd.read_csv("econ.csv")
dataset.head()
sns.set_style('white')
sns.set_context("paper", font_scale = 2)
sns.displot(data = dataset, x = "Value_1",kind = "hist", bins = 100, aspect = 1.5)
spac = dataset['Value_1'].values
f = Fitter(spac, distributions=get_distributions())
f.fit()
f.summary()
f.get_best(method='sumsquare_error')
This provides me with an output for Value_1:
{'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}
Now this is where I am stuck:
Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?
The dataset columns range from Value_1 to Value_99 with about 400 rows - Once I know how to feed the fitter results back into scipy stats I should be able to write a simple for loop to apply this over each column.
An example of the result would be like:
ID | Value1 | CDF_BestFit_Value1 |
---|---|---|
n | 0.9 | 0.33 |
n+1 | 0.7 | 0.07 |
Much appreciated in advanced for anyone who is able to help with this.
Upvotes: 1
Views: 223
Reputation: 3908
Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?
IIUC, the information you are trying to pass back to scipy.stats
is like
{'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}
In this example, you would want to evaluate the CDF of the norminvgauss
distribution with these parameters. The difficulty is that you want to do this programatically given the name of the distribution family as a string and the parameter name/value pairs as a dictionary.
One way to extract the name of the distribution family and the parameter info is like:
dist_name, params = list(d.items())[0]
(Perhaps there are more elegant ways, but this is the first that comes to mind. It's unusual to have just single key in the dictionary, but I often iterate over dictionaries like for dist_name, params in d.items()
.)
Then, the way to get the distribution family object is:
dist_family = getattr(stats, dist_name)
I would pass the parameters to the distribution family object to create a "frozen" distribution with those parameters:
dist = dist_family(**params)
and then you can evaluate its cdf
at whatever arguments you please.
dist.cdf([0, 0.5, 1])
For easier copy-pasting:
from scipy import stats
d = {'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}
dist_name, params = list(d.items())[0]
dist_family = getattr(stats, dist_name)
dist = dist_family(**params)
dist.cdf([0, 0.5, 1])
# array([0.45497638, 0.69433145, 0.87312848])
The rest - extracting the arguments from the dataframe and writing the results to the dataframe - is just pandas
, and it sounds like you have that under control.
Upvotes: 0