Mikal Fischer
Mikal Fischer

Reputation: 21

Python Applying a CDF after Fitter

I would like to apply the best fit CDF found by Fitter to each value in a number of panda data-frame columns by hopefully passing the Fitter results to Scipy Stats (or another library?).

I can get the distribution function easily enough from Fitter with the following code:

import numpy as np
import pandas as pd
import seaborn as sns
from fitter import Fitter
from fitter import get_common_distributions

from fitter import get_distributions

dataset = pd.read_csv("econ.csv")
dataset.head()

sns.set_style('white')
sns.set_context("paper", font_scale = 2)

sns.displot(data = dataset, x = "Value_1",kind = "hist", bins = 100, aspect = 1.5)

spac = dataset['Value_1'].values
f = Fitter(spac, distributions=get_distributions())

f.fit()
f.summary()

f.get_best(method='sumsquare_error')

This provides me with an output for Value_1:

{'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}

Now this is where I am stuck:

Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?

The dataset columns range from Value_1 to Value_99 with about 400 rows - Once I know how to feed the fitter results back into scipy stats I should be able to write a simple for loop to apply this over each column.

An example of the result would be like:

ID Value1 CDF_BestFit_Value1
n 0.9 0.33
n+1 0.7 0.07

Much appreciated in advanced for anyone who is able to help with this.

Upvotes: 1

Views: 223

Answers (1)

Matt Haberland
Matt Haberland

Reputation: 3908

Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?

IIUC, the information you are trying to pass back to scipy.stats is like

{'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}

In this example, you would want to evaluate the CDF of the norminvgauss distribution with these parameters. The difficulty is that you want to do this programatically given the name of the distribution family as a string and the parameter name/value pairs as a dictionary.

One way to extract the name of the distribution family and the parameter info is like:

dist_name, params = list(d.items())[0]

(Perhaps there are more elegant ways, but this is the first that comes to mind. It's unusual to have just single key in the dictionary, but I often iterate over dictionaries like for dist_name, params in d.items().)

Then, the way to get the distribution family object is:

dist_family = getattr(stats, dist_name)

I would pass the parameters to the distribution family object to create a "frozen" distribution with those parameters:

dist = dist_family(**params)

and then you can evaluate its cdf at whatever arguments you please.

dist.cdf([0, 0.5, 1])

For easier copy-pasting:

from scipy import stats
d = {'norminvgauss': {'a': 1.87, 'b': -0.65, 'loc': 0.46, 'scale': 1.24}}

dist_name, params = list(d.items())[0]
dist_family = getattr(stats, dist_name)
dist = dist_family(**params)
dist.cdf([0, 0.5, 1])
# array([0.45497638, 0.69433145, 0.87312848])

The rest - extracting the arguments from the dataframe and writing the results to the dataframe - is just pandas, and it sounds like you have that under control.

Upvotes: 0

Related Questions