user1205476
user1205476

Reputation: 401

How to sample in Tensorflow by custom probability distribution?

I have a vector e.g., V = [10, 30, 20, 50] of N elements and a probability vector P = [.2, .3, .1, .4]. In tensorflow, how can I randomly sample K elements from V that obey the given probability distribution P? I want the sampling to be done with replacement.

Upvotes: 2

Views: 5519

Answers (2)

Dr. Prasanna Date
Dr. Prasanna Date

Reputation: 775

tf.distributions.Categorical() might be the way to do it in a one liner. According to this page, given a probability distribution P defined over N values, tf.distributions.Categorical() can generate integers 0, 1, ..., N-1 with probabilities P[0], P[1], ..., P[N-1]. The generated integers can be interpreted as indices for the vector V. Following code snippet illustrates this:

# Probability distribution
P = [0.2, 0.3, 0.1, 0.4]

# Vector of values
V = [10, 30, 20, 50]

# Define categorical distribution
dist = tf.distributions.Categorical(probs=P)

# Generate a sample from categorical distribution - this serves as an index
index = dist.sample().eval()

# Fetch the value at V[index] as the sample
sample = V[index]

All of this can be done in a one liner:

sample = V[tf.distributions.Categorical(probs=P).sample().eval()]

If want to generate K samples from this distribution, wrap the above one liner in a list comprehension:

samples = [ V[tf.distributions.Categorical(probs=P).sample().eval()] for i in range(K) ]

Output of above code for K = 30:

[50, 10, 30, 50, 30, 20, 50, 30, 50, 50, 30, 50, 30, 50, 20, 10, 50, 20, 30, 30, 50, 50, 50, 30, 20, 50, 30, 30, 50, 50]

There might be better ways than using list comprehension though.

Upvotes: 2

Peter Szoldan
Peter Szoldan

Reputation: 4868

tf.nn.fixed_unigram_candidate_sampler does more or less what you want. The trouble is, it can only take int32 arguments as the unigrams parameter (probability distribution) because it was designed for high-number multiclass processing, such as language processing. You can multiply the numbers in the probability distribution to get to an integer, but only to a limit of accuracy.

Put the desired number of samples in num_samples and the probability weights into unigrams (has to be int32.) The parameter true_classes has to be filled with the same number of elements as num_true, but otherwise irrelevant, because you will get the indices back (and then use those to pull the sample.) unique can be changed to True as desired.

This is tested code for you:

import tensorflow as tf
import numpy as np
sess = tf.Session()

V = tf.constant( np.array( [[ 10, 30, 20, 50 ]]), dtype=tf.int64)

sampled_ids, true_expected_count, sampled_expected_count = tf.nn.fixed_unigram_candidate_sampler(
   true_classes = V,
   num_true = 4,
   num_sampled = 50,
   unique = False,
   range_max = 4,
   unigrams = [ 20, 30, 10, 40 ] # this is P, times 100
)
sample = tf.gather( V[ 0 ], sampled_ids )
x = sess.run( sample )
print( x )

Output:

[50 20 10 30 30 30 10 30 20 50 50 50 10 50 10 30 50 50 30 30 50 10 20 30 50 50 50 50 30 50 50 30 50 50 50 50 50 50 50 10 50 30 50 10 50 50 10 30 50 50]

If you really want to use float32 probability values, then you have to create the sampler from several parts (no one operation exists for this), like this (tested code):

import tensorflow as tf
import numpy as np
sess = tf.Session()

k = 50 # number of samples you want
V = tf.constant( [ 10, 30, 20, 50 ], dtype = tf.float32 ) # values
P = tf.constant( [ 0.2, 0.3, 0.1, 0.4 ], dtype = tf.float32 ) # prob dist

cum_dist = tf.cumsum( P ) # create cumulative probability distribution

# get random values between 0 and the max of cum_dist
# we'll determine where it is in the cumulative distribution
rand_unif = tf.random_uniform( shape=( k, ), minval = 0.0, maxval = tf.reduce_max( cum_dist ), dtype = tf.float32 )

# create boolean to signal where the random number is greater than the cum_dist
# take advantage of broadcasting to create Cartesian product
greater = tf.expand_dims( rand_unif, axis = -1 ) > tf.expand_dims( cum_dist, axis = 0 )

# we get the indices by counting how many are greater in any given row
idxs = tf.reduce_sum( tf.cast( greater, dtype = tf.int64 ), 1 )

# then just gather the sample from V by the indices
sample = tf.gather( V, idxs )

# run, output
print( sess.run( sample ) )

Output:

[20. 10. 50. 50. 20. 30. 10. 20. 30. 50. 20. 50. 30. 50. 30. 50. 50. 50. 50. 50. 50. 30. 20. 20. 20. 10. 50. 30. 30. 10. 50. 50. 50. 20. 30. 50. 30. 10. 50. 20. 30. 50. 30. 10. 10. 50. 50. 20. 50. 30.]

Upvotes: 3

Related Questions