Reputation: 15
If I generate a array
custom=np.ones(800, dtype=np.float32)
Then create a Custom Probability distribution out of it using
custom=normalize(custom)[0]
customPDF = stats.rv_discrete(name='pdfX', values=(np.arange(800), custom))
Then if I use
customPDF.rvs()
I get values returned in the range 0 - 20, whereas i expect random numbers varying from 0 to 800.
The following code gives me required output,
random.uniform(0,800)
But due to the necessity of being able to manipulate the probability distribution through altering the custom array, i have to use customPDF.rvs()
is there a solution to this or why this is happening??
In [206]: custom=np.ones(800, dtype=np.float32)
In [207]: custom=normalize(custom)[0]
/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
In [208]: customPDF = stats.rv_discrete(name='pdfX', values=(np.arange(800), custom))
In [209]: customPDF.rvs()
Out[209]: 7
In [210]: customPDF.rvs()
Out[210]: 13
In [211]: customPDF.rvs()
Out[211]: 15
In [212]: customPDF.rvs()
Out[212]: 3
In [213]: customPDF.rvs()
Out[213]: 8
In [214]: customPDF.rvs()
Out[214]: 10
In [215]: customPDF.rvs()
Out[215]: 10
In [216]: customPDF.rvs()
Out[216]: 11
In [217]: customPDF.rvs()
Out[217]: 15
In [218]: customPDF.rvs()
Out[218]: 6
In [219]: customPDF.rvs()
Out[219]: 7
In [220]: random.uniform(0,800)
Out[220]: 707.0265562968543
Upvotes: 1
Views: 542
Reputation: 74154
The problem is this line:
custom=normalize(custom)[0]
Based on the warning, it looks like normalize
refers to sklearn.preprocessing.normalize
. normalize
expects an [n_samples, n_features]
2D array - since you give it a 1D vector it will insert a new dimension and treat it as a [1, n_features]
array (hence why you are indexing the 0th element of the output).
By default it will adjust the L2 (Euclidean) norm of each row of features to be equal to 1. This is not the same as making the elements sum to 1:
print(normalize(np.ones(800))[0].sum())
# 28.2843
Since the sum of custom
is much greater than 1, the cumulative probability of drawing a particular integer reaches 1 before you get to the end of the probability vector:
print(custom.cumsum().searchsorted(1))
# 28
The consequence is that you will never draw an integer larger than 28:
print(customPDF.rvs(size=100000).max())
# 28
What you ought to in order to normalize custom
is divide by its sum:
custom /= custom.sum()
# or alternatively:
custom = np.repeat(1./800, 800)
Upvotes: 4