Reputation: 37
Ok so I need to create some random data for simulation purposes. I know the mean values and standard deviations of some real life scenarios that I am trying to generate. The issue that I am having is that the random number generated that correspond with the dates are not realistic. For example the weather (MinTP), it fluctuations largely which is not realistic. I want the numbers to be generated in a certain pattern so that the mean would appear at the middle of the data set. Please see example of my code below and output of table and weather scatterplot over the year. I have been using np.random.normal() to generate the data maybe I need to use a different function?
import numpy as np
import pandas as pd
import datetime
np.random.seed(2)
start2018 = datetime.datetime(2018, 1, 1)
end2018 = datetime.datetime(2018, 12, 31)
dates2018 = pd.date_range(start2018, end2018, freq='d')
synEne2018 = np.random.normal(loc=66.883795, scale=5.448145, size=365)
synMintp2018 = np.random.normal(loc=7.203288, scale=4.690315, size=365)
synCovidDailyCases2018 = np.random.normal(loc=0.0, scale=0.0, size=365)
synCovidDailyDeaths2018 = np.random.normal(loc=0.0, scale=0.0, size=365)
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018, 'MinTp': synMintp2018, 'DailyCovidCases': synCovidDailyCases2018, 'DailyCovidDeaths': synCovidDailyDeaths2018})
print(syn2018data)
fig, ax =plt.subplots()
sns.scatterplot(x="Date", y='MinTp', data=syn2018data[0:], color='r')
Upvotes: 1
Views: 415
Reputation: 2167
A normal distribution has two parameters :
std
, which is the parameter "scale". It's actually not the whole scale of your resulting values, but it's a distribution that follows the normal law. Basically, it means that 68% of your data will be within one std
from your mean
, 95% of your data will be within two std
from your mean
, and 99.7% of your values will be within three std
from your mean
.You can use a normal distribution table to have a grasp of the values you can expect to get.
Keep in mind that this table represents the probability to get a value between mean
and mean + z * std
, not the probability to get a value between mean - z * std
and mean + z * std
. You have to do 2 * p - 1
to get the latter.
If you lower the scale, you will get values closer to your mean.
To be more realistic, I would advice to get a curve for base minTp (with the minimum value in winter, and maximum value in august), then add randomness using the normal distribution with loc=0 and scale=0.2 or so.
Using a sinus from zero to pi as a base can do the trick if you specify a mean and your range in your sin
function :
import math
start2018 = datetime.datetime(2018, 1, 1)
end2018 = datetime.datetime(2018, 12, 31)
dates2018 = pd.date_range(start2018, end2018, freq='d')
t_range = 4.690315 # our range
t_mean = 7.203288 # our mean
synMintp2018 = np.sin(np.arange(365)/365 * math.pi)*t_range + t_mean
synMintp2018 += np.random.normal(loc=0, scale=0.2, size=365)
...
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018, 'MinTp': synMintp2018, 'DailyCovidCases': synCovidDailyCases2018, 'DailyCovidDeaths': synCovidDailyDeaths2018})
fig, ax =plt.subplots()
sns.scatterplot(x="Date", y='MinTp', data=syn2018data[0:], color='r')
Since the minimum is more likely to be in january, we can add offset to translate the base curve.
import math
synMintp2018 = np.sin(np.arange(365)/365 * math.pi) * 4.690315 + 7.203288
synMintp2018 = np.roll(synMintp2018, 15) # offset in days
synMintp2018 += np.random.normal(loc=0, scale=0.2, size=365)
...
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018, 'MinTp': synMintp2018, 'DailyCovidCases': synCovidDailyCases2018, 'DailyCovidDeaths': synCovidDailyDeaths2018})
fig, ax =plt.subplots()
sns.scatterplot(x=syn2018data.index, y=syn2018data['MinTp'], color='r')
Upvotes: 1