Tensorflow Probability VI: Discrete + Continuous RVs inference: gradient estimation?

Question

tensorflow==2.7.0
tensorflow-probability==0.14.1

TLDR

To perform VI on discrete RVs, should I use:

A- the REINFORCE gradient estimator
B- the Gumbel-Softmax reparametrization
C- another solution

and how to implement it ?

Problem statement

Sorry in advance for the long issue, but I believe the problem requires some explaining.

I want to implement a Hierarchical Bayesian Model involving both continuous and discrete Random Variables. A minimal example is a Gaussian Mixture model:

import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions
tfb = tfp.bijectors

G = 2

p = tfd.JointDistributionNamed(
    model=dict(
        mu=tfd.Sample(
            tfd.Normal(0., 1.),
            sample_shape=(G,)
        ),
        z=tfd.Categorical(
            probs=tf.ones((G,)) / G
        ),
        x=lambda mu, z: tfd.Normal(
            loc=mu[z],
            scale=1.
        )
    )
)

In this example I don't use the tfd.Mixture API on purpose to expose the Categorical label. I want to perform Variational Inference in this context, and for instance given an observed x fit over the posterior of z a Categorical distribution with parametric probabilities:

q_probs = tfp.util.TransformedVariable(
    tf.ones((G,)) / G,
    tfb.SoftmaxCentered(),
    name="q_probs"
)
q_loc = tf.Variable(0., name="q_loc")
q_scale = tfp.util.TransformedVariable(
    1.,
    tfb.Exp(),
    name="q_scale"
)

q = tfd.JointDistributionNamed(
    model=dict(
        mu=tfd.Normal(q_loc, q_scale),
        z=tfd.Categorical(probs=q_probs)
    )
)

The issue is: when computing the ELBO and trying to optimize for the optimal q_probs I cannot use the reparameterization gradient estimators: this is AFAIK because z is a discrete RV:


def log_prob_fn(**kwargs):
    return p.log_prob(
        **kwargs,
        x=tf.constant([2.])
    )


optimizer = tf.optimizers.SGD()

@tf.function
def fit_vi():
    return tfp.vi.fit_surrogate_posterior(
        target_log_prob_fn=log_prob_fn,
        surrogate_posterior=q,
        optimizer=optimizer,
        num_steps=10,
        sample_size=8
    )

_ = fit_vi() 
# This last line raises:
# ValueError: Distribution `surrogate_posterior` must be reparameterized, i.e.,a diffeomorphic transformation
# of a parameterless distribution. (Otherwise this function has a biased gradient.)

I'm looking into a way to make this work. I've identified at least 2 ways to circumvent the issue: using REINFORCE gradient estimator or the Gumbel-Softmax reparameterization.

A- REINFORCE gradient

cf this TFP API link a classical result in VI is that the REINFORCE gradient can deal with a non-differentiable objective function, for instance due to discrete RVs.

I can use a tfp.vi.GradientEstimators.SCORE_FUNCTION estimator instead of the tfp.vi.GradientEstimators.REPARAMETERIZATION one using the lower-level tfp.vi.monte_carlo_variational_loss function? Using the REINFORCE gradient, In only need the log_prob method of q to be differentiable, but the sample method needn't be differentiated.

As far as I understood it, the sample method for a Categorical distribution implies a gradient break, but the log_prob method does not. Am I correct to assume that this could help with my issue? Am I missing something here?

Also I wonder: why is this possibility not exposed in the tfp.vi.fit_surrogate_posterior API ? Is the performance bad, meaning is the variance of the estimator too large for practical purposes ?

B- Gumbel-Softmax reparameterization

cf this TFP API link I could also reparameterize z as a variable y = tfd.RelaxedOneHotCategorical(...) . The issue is: I need to have a proper categorical label to use for the definition of x, so AFAIK I need to do the following:

p_GS = tfd.JointDistributionNamed(
    model=dict(
        mu=tfd.Sample(
            tfd.Normal(0., 1.),
            sample_shape=(G,)
        ),
        y=tfd.RelaxedOneHotCategorical(
            temperature=1.,
            probs=tf.ones((G,)) / G
        ),
        x=lambda mu, y: tfd.Normal(
            loc=mu[tf.argmax(y)],
            scale=1.
        )
    )
)

...but his would just move the gradient breaking problem to tf.argmax. This is where I maybe miss something. Following the Gumbel-Softmax (Jang et al., 2016) paper, I could then use the "STRAIGHT-THROUGH" (ST) strategy and "plug" the gradients of the variable tf.one_hot(tf.argmax(y)) -the "discrete y"- onto y -the "continuous y".

But again I wonder: how to do this properly ? I don't want to mix and match the gradients by hand, and I guess an autodiff backend is precisely meant to avoid me this issue. How could I create a distribution that differentiates the forward direction (sampling a "discrete y") from the backward direction (gradient computed using the "continuous y") ? I guess this is the meant usage of the tfd.RelaxedOneHotCategorical distribution, but I don't see this implemented anywhere in the API.

Should I implement this myself ? How ? Could I use something in the lines of tf.custom_gradient?

Actual question

Which solution -A or B or another- is meant to be used in the TFP API, if any? How should I implement said solution efficiently?