Alexander
Alexander

Reputation: 15

In Scipy why does custom.rvs() having uniform probability return the values in the begining region only?

If I generate a array

custom=np.ones(800, dtype=np.float32)

Then create a Custom Probability distribution out of it using

custom=normalize(custom)[0]
customPDF = stats.rv_discrete(name='pdfX', values=(np.arange(800), custom))

Then if I use

customPDF.rvs()

I get values returned in the range 0 - 20, whereas i expect random numbers varying from 0 to 800.

The following code gives me required output,

random.uniform(0,800) 

But due to the necessity of being able to manipulate the probability distribution through altering the custom array, i have to use customPDF.rvs()

is there a solution to this or why this is happening??


In [206]: custom=np.ones(800, dtype=np.float32)

In [207]: custom=normalize(custom)[0]
/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

In [208]: customPDF = stats.rv_discrete(name='pdfX', values=(np.arange(800), custom))

In [209]: customPDF.rvs()
Out[209]: 7

In [210]: customPDF.rvs()
Out[210]: 13

In [211]: customPDF.rvs()
Out[211]: 15

In [212]: customPDF.rvs()
Out[212]: 3

In [213]: customPDF.rvs()
Out[213]: 8

In [214]: customPDF.rvs()
Out[214]: 10

In [215]: customPDF.rvs()
Out[215]: 10

In [216]: customPDF.rvs()
Out[216]: 11

In [217]: customPDF.rvs()
Out[217]: 15

In [218]: customPDF.rvs()
Out[218]: 6

In [219]: customPDF.rvs()
Out[219]: 7

In [220]: random.uniform(0,800)
Out[220]: 707.0265562968543

Upvotes: 1

Views: 542

Answers (1)

ali_m
ali_m

Reputation: 74154

The problem is this line:

custom=normalize(custom)[0]

Based on the warning, it looks like normalize refers to sklearn.preprocessing.normalize. normalize expects an [n_samples, n_features] 2D array - since you give it a 1D vector it will insert a new dimension and treat it as a [1, n_features] array (hence why you are indexing the 0th element of the output).

By default it will adjust the L2 (Euclidean) norm of each row of features to be equal to 1. This is not the same as making the elements sum to 1:

print(normalize(np.ones(800))[0].sum())
# 28.2843

Since the sum of custom is much greater than 1, the cumulative probability of drawing a particular integer reaches 1 before you get to the end of the probability vector:

print(custom.cumsum().searchsorted(1))
# 28

The consequence is that you will never draw an integer larger than 28:

print(customPDF.rvs(size=100000).max())
# 28

What you ought to in order to normalize custom is divide by its sum:

custom /= custom.sum()

# or alternatively:
custom = np.repeat(1./800, 800)

Upvotes: 4

Related Questions