Rick
Rick

Reputation: 79

How can I generate samples from a non-normal multivariable distribution in Python?

I have an input dataframe df_input with 10 variables and 100 rows. This data are not normal distributed. I would like to generate an output dataframe with 10 variables and 10,000 rows, such that the covariance matrix and mean of the new dataframe are the same as those of the original one. The output variables should not be normal distributed, but rather have a distribution similar to the input variables. That is: Cov(df_output) = Cov(df_input) and mean(df_ouput) = mean(df_input) Is there a Python function that does it?

Note: np.random.multivariate_normal(mean_input,Cov_input,10000) does almost this, but the output variables are normal distributed, whereas I need them to have the same (or similar) distribution as the input.

Upvotes: 1

Views: 2146

Answers (4)

Rick
Rick

Reputation: 79

The best method is indeed to use Copulas, as suggested by many. A simple description is in the link below, which provides also a simple python code. The method preserves covariances, while augmenting the data. It allows generalization to non-symetric or non-normal distributions. Thanks for all for helping.

https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html.

Upvotes: 0

Pierre D
Pierre D

Reputation: 26251

Update

I just noticed your mention of np.random.multivariate_normal... It does in one swell swoop the equivalent of gen_like() below!

I'll leave it here to help people understand the mechanics of this, but to summarize:

  1. you can match the mean and covariance of an empirical distribution with a (rotated, scaled, translated) normal;
  2. for a better match of higher moments, you should look at the copula.

Original answer

Since you are interested in only matching the two first moments (mean, variance), you can use a simple PCA to obtain a suitable model of the initial data. Note that the new generated data will be a normal ellipsoid, rotated, scaled, and translated to match the empirical mean and covariance of the initial data.

If you want more sophisticated "replication" of the original distribution, then you should look at Copula as I said in the comments.

So, for the first two moments only, assuming your input data is d0:

from sklearn.decomposition import PCA

def gen_like(d0, n):
    pca = PCA(n_components=d0.shape[1]).fit(d0)
    z0 = pca.transform(d0)  # z0 is centered and uncorrelated (cov is diagonal)
    z1 = np.random.normal(size=(n, d0.shape[1])) * np.std(z0, 0)

    # project back to input space
    d1 = pca.inverse_transform(z1)
    return d1

Example:

# generate some random data

# arbitrary transformation matrix
F = np.array([
    [1, 2, 3],
    [2, 1, 4],
    [5, 1, 3],
])
d0 = np.random.normal(2, 4, size=(10000, 3)) @ F.T

np.mean(d0, 0)
# ex: array([12.12791066, 14.10333273, 17.95212292])

np.cov(d0.T)
# ex: array([[225.09691912, 257.39878551, 259.40288019],
#            [257.39878551, 338.34087242, 373.4773562 ],
#            [259.40288019, 373.4773562 , 566.29288861]])
# try to match mean, variance of d0
d1 = gen_like(d0, 10000)

np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)

np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)

What's funny is that you can fit a square peg in a round hole (i.e., demonstrating that really only mean, variance are matched, not the higher moments):

d0 = np.random.uniform(5, 10, size=(1000, 3)) @ F.T
d1 = gen_like(d0, 10000)

np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)

np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)

Upvotes: 1

Mafu
Mafu

Reputation: 94

Have you considered using a GAN (generative adversarial network)? Takes a bit more effort than just using a predefined function, but essentially it does exactly what you are hoping to do. Here's the original paper: https://arxiv.org/abs/1406.2661

There are many PyTorch/Tensorflow codes that you can download and fit to your purposes, for example this one: https://github.com/eriklindernoren/PyTorch-GAN

Here is also a blog post I found quite helpful with an introduction to GANs. https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f

Maybe GAN is overkill for this problem and there are simpler methods for upscaling the sample size, in which case I'd be interested to learn about them.

Upvotes: 0

John
John

Reputation: 1021

Have you tried looking at NumPy docs?: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html

Upvotes: 0

Related Questions