CAPSLOCK
CAPSLOCK

Reputation: 6483

scipy.stats.probplot to generate qqplot using a custom distribution

I am trying to get scipy.stats.probplot to plot a QQplot with a custom distribution. Basically I have a bunch of numeric variables (all numpy arrays) and I want to check distributional differences with a QQplot.

My dataframe df looks something like this:

         some_var  another_var
1        16.5704   3.3620
2        12.8373  -8.2204
3        8.1854    1.9617
4        13.5683   1.8376
5        8.5143    2.3173
6        6.0123   -7.7536
7        9.6775   -4.3874
...      ...       ...
189499   11.8561  -8.4887
189500   10.0422  -4.6228

According to the reference:

dist : str or stats.distributions instance, optional

Distribution or distribution function name. The default is ‘norm’ for a normal probability plot. Objects that look enough like a stats.distributions instance (i.e. they have a ppf method) are also accepted.

Of course a numpy array doesn't have the ppf method, so when I try the following:

import scipy.stats as stats
stats.probplot(X[X.columns[1]].values, dist=X[X.columns[2]].values, plot=pylab)

I get the following error:

AttributeError: 'numpy.ndarray' object has no attribute 'ppf'

(N.B. if I do not use the .values method I would get the same error but for a 'Series' object instead of 'numpy.ndarry')

So, the question is: what is an object with a ppf method and how do I create it from my numpy array?

Upvotes: 0

Views: 3726

Answers (1)

Paul H
Paul H

Reputation: 68186

The "dist" object should be an instance or class of scipy's statistical distributions. That is what is meant by:

dist : str or stats.distributions instance, optional

So a self-contained example would be:

import numpy
from matplotlib import pyplot
from scipy import stats

random_beta = numpy.random.beta(0.3, 2, size=37)

fig, ax = pyplot.subplots(figsize=(6, 3))

_ = stats.probplot(
    random_beta,       # data
    sparams=(0.3, 2),  # guesses at the distribution's parameters
    dist=stats.beta,   # the "dist" object
    plot=ax            # where the data should be plotted
)

And you'll get:

enter image description here

If you want to plot multiple columns of a data frame, you'll need to call probplot multiple times, plotting on the same (or new) axes each time.

In this simple case, the probscale package doesn't offer much. But it might be more flexible for doing probability scales instead of quantile scales if that's a direction you might head in the future:

import probscale

fig, ax = pyplot.subplots(figsize=(6, 3))
fig = probscale.probplot(
    random_beta,
    ax=ax,
    plottype='qq',
    bestfit=True,
    dist=stats.beta(0.3, 2)
)

enter image description here

Upvotes: 2

Related Questions