Reputation: 461
I wanted to create a data set with a specific Mean and Std deviation.
Using np.random.normal() gives me an approximate. However for what I want to test I need an exact Mean and Std deviation.
I have tried using a combination of norm.pdf and np.linspace however the data set generated doesn't match up either (It could just be me misusing it though).
It really doesn't matter whether the data set is random or not as long as I can set a specific Sample size, mean and Std deviation.
Help would be much appreciated
Upvotes: 12
Views: 18710
Reputation: 695
You can also do this with the random library.
import random as rand
mean = 20.9
stdd = 3
samples = 1000
data = [rand.normalvariate(mean, stdd) for i in range(samples)]
I also needed to generate data with residuals, so I simply added the product of a rand.randomrange(-1,1)
with the residual.
data = [rand.normalvariate(mean, stdd)+(rand.randrange(-1,1)*residual) for i in range(samples)]
Note by adding residuals you will throw off the exact mean and stdd slightly.
Upvotes: 0
Reputation: 613
For others seeing this later, Python 3.8+ has the statistics.NormalDist class for exactly this purpose:
import statistics as s
n = s.NormalDist(mu=10, sigma=2)
samples = n.samples(100_000, seed=42) # remove seed if desired
print(s.mean(samples)) # 10.004521585462394
print(s.stdev(samples)) # 2.0052615406360457
Methods from @Spoonless's answer can be used to tweak the exact mean and stdev of the samples if needed, or one can just use a large enough number of samples to get exceedingly close -- this is statistics, after all.
Upvotes: 6
Reputation: 571
The easiest would be to generate some zero-mean samples, with the desired standard deviation. Then subtract the sample mean from the samples so it is truly zero mean. Then scale the samples so that the standard deviation is spot on, and then add the desired mean.
Here is some example code:
import numpy as np
num_samples = 1000
desired_mean = 50.0
desired_std_dev = 10.0
samples = np.random.normal(loc=0.0, scale=desired_std_dev, size=num_samples)
actual_mean = np.mean(samples)
actual_std = np.std(samples)
print("Initial samples stats : mean = {:.4f} stdv = {:.4f}".format(actual_mean, actual_std))
zero_mean_samples = samples - (actual_mean)
zero_mean_mean = np.mean(zero_mean_samples)
zero_mean_std = np.std(zero_mean_samples)
print("True zero samples stats : mean = {:.4f} stdv = {:.4f}".format(zero_mean_mean, zero_mean_std))
scaled_samples = zero_mean_samples * (desired_std_dev/zero_mean_std)
scaled_mean = np.mean(scaled_samples)
scaled_std = np.std(scaled_samples)
print("Scaled samples stats : mean = {:.4f} stdv = {:.4f}".format(scaled_mean, scaled_std))
final_samples = scaled_samples + desired_mean
final_mean = np.mean(final_samples)
final_std = np.std(final_samples)
print("Final samples stats : mean = {:.4f} stdv = {:.4f}".format(final_mean, final_std))
Which produces output similar to this:
Initial samples stats : mean = 0.2946 stdv = 10.1609
True zero samples stats : mean = 0.0000 stdv = 10.1609
Scaled samples stats : mean = 0.0000 stdv = 10.0000
Final samples stats : mean = 50.0000 stdv = 10.0000
Upvotes: 14