Reputation: 79
I have an input dataframe df_input with 10 variables and 100 rows. This data are not normal distributed. I would like to generate an output dataframe with 10 variables and 10,000 rows, such that the covariance matrix and mean of the new dataframe are the same as those of the original one. The output variables should not be normal distributed, but rather have a distribution similar to the input variables. That is: Cov(df_output) = Cov(df_input) and mean(df_ouput) = mean(df_input) Is there a Python function that does it?
Note: np.random.multivariate_normal(mean_input,Cov_input,10000) does almost this, but the output variables are normal distributed, whereas I need them to have the same (or similar) distribution as the input.
Upvotes: 1
Views: 2146
Reputation: 79
The best method is indeed to use Copulas, as suggested by many. A simple description is in the link below, which provides also a simple python code. The method preserves covariances, while augmenting the data. It allows generalization to non-symetric or non-normal distributions. Thanks for all for helping.
https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html.
Upvotes: 0
Reputation: 26251
Update
I just noticed your mention of np.random.multivariate_normal
... It does in one swell swoop the equivalent of gen_like()
below!
I'll leave it here to help people understand the mechanics of this, but to summarize:
Original answer
Since you are interested in only matching the two first moments (mean, variance), you can use a simple PCA to obtain a suitable model of the initial data. Note that the new generated data will be a normal ellipsoid, rotated, scaled, and translated to match the empirical mean and covariance of the initial data.
If you want more sophisticated "replication" of the original distribution, then you should look at Copula as I said in the comments.
So, for the first two moments only, assuming your input data is d0
:
from sklearn.decomposition import PCA
def gen_like(d0, n):
pca = PCA(n_components=d0.shape[1]).fit(d0)
z0 = pca.transform(d0) # z0 is centered and uncorrelated (cov is diagonal)
z1 = np.random.normal(size=(n, d0.shape[1])) * np.std(z0, 0)
# project back to input space
d1 = pca.inverse_transform(z1)
return d1
Example:
# generate some random data
# arbitrary transformation matrix
F = np.array([
[1, 2, 3],
[2, 1, 4],
[5, 1, 3],
])
d0 = np.random.normal(2, 4, size=(10000, 3)) @ F.T
np.mean(d0, 0)
# ex: array([12.12791066, 14.10333273, 17.95212292])
np.cov(d0.T)
# ex: array([[225.09691912, 257.39878551, 259.40288019],
# [257.39878551, 338.34087242, 373.4773562 ],
# [259.40288019, 373.4773562 , 566.29288861]])
# try to match mean, variance of d0
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
What's funny is that you can fit a square peg in a round hole (i.e., demonstrating that really only mean, variance are matched, not the higher moments):
d0 = np.random.uniform(5, 10, size=(1000, 3)) @ F.T
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
Upvotes: 1
Reputation: 94
Have you considered using a GAN (generative adversarial network)? Takes a bit more effort than just using a predefined function, but essentially it does exactly what you are hoping to do. Here's the original paper: https://arxiv.org/abs/1406.2661
There are many PyTorch/Tensorflow codes that you can download and fit to your purposes, for example this one: https://github.com/eriklindernoren/PyTorch-GAN
Here is also a blog post I found quite helpful with an introduction to GANs. https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f
Maybe GAN is overkill for this problem and there are simpler methods for upscaling the sample size, in which case I'd be interested to learn about them.
Upvotes: 0
Reputation: 1021
Have you tried looking at NumPy
docs?:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html
Upvotes: 0