axel_ande
axel_ande

Reputation: 449

Generating lognormal samples that fitts the data it was generated from

I'm trying to create a new sample based on some other samples but I'm doing/understanding something wrong. I have 34 samples which I assume is relatively lognorm distributed. Based on this samples I want to generate 2000 new samples. Here is the code that I'm running:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

samples = [480, 900, 1140, 1260, 1260, 1440, 1800, 1860, 1980, 2220, 2640, 2700,
           2880, 3420, 3480, 3600, 3840, 4020, 4200, 4320, 4380, 4920, 5160,
           5280, 6900, 7680, 9000, 10320, 10500, 10800, 15000, 21600, 25200,
           39000]
plt.plot(samples, 1 - np.linspace(0, 1, len(samples)))
std, loc, scale = stats.lognorm.fit(samples)
new_samples = stats.lognorm(std, loc=loc, scale=scale).rvs(size=2000)

a = plt.hist(new_samples, bins=range(100, 40000, 200),
             weights=np.ones(len(new_samples)) / len(new_samples))
plt.show()

Here is the plot, and as you can see there are really few samples above 1000, although the sample contained rather many above 1000.

plotPlot2

How do I best generate a sample that better represent the expected values?

Upvotes: 2

Views: 100

Answers (1)

JohanC
JohanC

Reputation: 80299

Something seems to be going wrong here with stats.lognorm.fit.

The docs mention an alternative by fitting stats.norm of the log of the samples and then using exp(mu) as scale. That seems to work a lot better.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

samples = [480, 900, 1140, 1260, 1260, 1440, 1800, 1860, 1980, 2220, 2640, 2700,
           2880, 3420, 3480, 3600, 3840, 4020, 4200, 4320, 4380, 4920, 5160,
           5280, 6900, 7680, 9000, 10320, 10500, 10800, 15000, 21600, 25200,
           39000]
samples = np.array(samples)

std, loc, scale = stats.lognorm.fit(samples) # 2.865850745357322, 479.99969879223596, 1.1400622824414484
weird_samples = stats.lognorm(std, loc=loc, scale=scale).rvs(size=2000)

mu, std = stats.norm.fit(np.log(samples)) # 8.304837454505837, 0.9720253999925554
scale = np.exp(mu) # 4043.3848507251523
loc = 0
new_samples = stats.lognorm(std, loc=loc, scale=scale).rvs(size=2000)

plt.plot(samples, 1 - np.linspace(0, 1, len(samples)), label='given samples')
plt.plot(np.sort(weird_samples), 1 - np.linspace(0, 1, len(weird_samples)), label='using stats.lognorm.fit(samples)')
plt.plot(np.sort(new_samples), 1 - np.linspace(0, 1, len(new_samples)), label='using stats.norm.fit(log(samples))')
plt.legend()
plt.show()

resulting plot

Seaborn's kdeplot shows the following:

import seaborn as sns

bw = 1500
sns.kdeplot(samples, bw=bw, label='given samples')
sns.kdeplot(weird_samples, bw=bw, label='using stats.lognorm.fit(samples)')
sns.kdeplot(new_samples, bw=bw, label='using stats.norm.fit(log(samples))')
plt.xlim(-5000, 45000)
plt.show()

kdeplots

PS: The problem seems to be that fitting 3 parameters using limited samples doesn't work very well. You can force lognorm.fit to use loc=0, which finds much more sensible parameters. The loc parameter just shifts the samples with that amount; often loc=0 is the better choice.

std, loc, scale = stats.lognorm.fit(samples, floc=0) # 0.9720253999925554, 0.0, 4043.3848507251523

Instead of forcing loc with floc, you can also provide an initial guess. This looks even better:

std, loc, scale = stats.lognorm.fit(samples, loc=0) # 1.0527481074345748, 203.08004314932137, 3712.4903893865644

fitting with initial guess

Upvotes: 2

Related Questions