Sampling from multivariate customised cumulative distribution function in Matlab

Question

Consider a 5-variate cumulative distribution function (cdf) which I call F.

I want to sample random 5x1 vectors from this cdf in Matlab. F is not a cdf that has been already implemented in Matlab (such as normal, t-student, etc.). Specifically, it is defined as

I have read several question/answers in this and other forums on how to sample from customised probability distribution functions in Matlab. However,

1) Most of them are for univariate cdf, e.g. here. The idea is to apply inverse transform sampling. My problem is a bit more complicated because I would need to "invert"a 5-variate function.

2) Another option could be to use slicesample as suggested here but I don't know how to write the analytical expression of the probability density function in my case.

3) Here's another idea but specific for the bivariate case.

Could you kindly help me to understand how I can proceed?

Cris Luengo · Accepted Answer

Your link under #3 gives a hint of the solution. It explains the bivariate case when you have a PDF. Here we'll extend this to any number of dimensions, for the case when you have a CDF.

So the process is:

Compute the marginal CDF for r₁.
Take a random sample using this marginal CDF (another link you posted explains how to do this).
Compute the marginal CDF for r₂ given r₁.
Take a random sample using this marginal CDF
Compute the marginal CDF for r₃ given r₁ and r₂.
etc. etc. You see where this is going.

Note that if you have a PDF, computing marginal distributions involves integrating over the remaining variables. So a marginal distribution for r₁ requires integrating over r₂..r₅, a marginal distribution for r₂ given r₁ requires integrating over r₃..r₅, etc.

When you have a CDF, computing the marginal distributions is trivial, as it already integrates the PDF: the marginal distribution for r₁ is F(x,∞,∞,∞,∞). However, obtaining the marginal distribution given one or more variables requires differentiating: a marginal distribution for r₂ given r₁ requires differentiating along r₁, a marginal distribution for r₃ given r₁ and r₂ requires differentiating along r₁ and r₂, etc.

It might be possible to obtain these derivatives analytically (this would be the more efficient solution). Here we use the finite difference derivative approximation instead (this makes it easier to plug in any CDF).

Let's see some MATLAB code:

sigma_a = 0.5;
sigma_b = 0.3;
F = @(r1,r2,r3,r4,r5)exp(-exp(-r1) - (exp(-r2/sigma_a)+exp(-r3/sigma_a)).^sigma_a ...
                                   - (exp(-r4/sigma_b)+exp(-r5/sigma_b)).^sigma_b);

lims = [-5,10]; % This is the area along all dimensions containing 99.99% of the PDF

N = 1000;
values = zeros(N,5);
for n=1:N
   values(n,:) = sample_random(F,5,lims);
end

Here I picked some random values for sigma_a and sigma_b, and used those to define a function F of 5 variables r1..r5. I figured that the domain of the PDF is the same along all dimensions, I found a region slightly larger than really necessary (lims). Next, I obtain 1000 random samples from the distribution F by calling sample_random:

function r = sample_random(F,N,lims)
delta = diff(lims)/10000;
x = linspace(lims(1),lims(2),300);
r = inf(1,N);
for ii = 1:N
   marginal = get_marginal(F,r,ii,x,delta);
   p = rand * marginal(end);
   [~,I] = unique(marginal); % interp1 cannot handle duplicated points, let's remove them
   r(ii) = interp1(marginal(I),x(I),p);
end

delta is the distance we'll use for the finite difference approximation to the derivative. x represents sample points along any one dimension of F.

We first define r as the vector [inf,inf,inf,inf,inf], we'll use this as the sample locations, and at the end of the function it will contain the random value drawn from our distribution.

Next we loop over the 5 dimensions, in each iteration we sample the marginal distribution for dimension ii, given the values for the previous dimensions (which have already been picked). The function get_marginal is below. We pick a random value in between 0 and the max of this marginal CDF (note that the max decreases as we pick values of r for each dimension, when ii==1 the max is 1), and we use this random value to interpolate into the inverse sampled marginal CDF (inverse simply means swapping x and y). I needed to remove repeated values from marginal because it becomes the x in interp1, and this function requires the x values to be unique.

Finally, the function get_marginal:

function marginal = get_marginal(F,r,ii,x,delta)
N = length(r);
marginal = zeros(size(x));
for jj=0:2^(ii-1)-1
   rr = flip(dec2bin(jj,N)-'0');
   sign = mod(sum(rr,2),2);
   if sign == 0
      sign = 1;
   else
      sign = -1;
   end
   args = num2cell(r - delta * rr);
   args{ii} = x;
   marginal = marginal + sign * F(args{:});
end

This contains quite a bit of complication. It samples the CDF along a given dimension ii, at the points x, given the fixed values r(1:ii-1).

The complication comes from computing partial derivatives. If we were to compute the marginal distribution for any one dimension without having picked any fixed values yet, we would simply do e.g.

marginal = F(inf,x,inf,inf,inf);

Having picked one value, we would do

marginal = F(r1,x,inf,inf,inf) - F(r1-delta,x,inf,inf,inf);

(this is the approximation to the partial derivative along the first dimension).

The code in get_marginal does this for ii-1 fixed values. This requires sampling F twice for each of these fixed values, as well as for each combination of delta shifts, a total of n^2 times (for n fixed values). The dec2bin bit is to get all these combinations. sign determines whether to add or subtract the given sample from the running total. args is a cell array with the 5 arguments to the function F, elements 1:ii-1 are the fixed values, element ii is set to x, and elements ii+1:N are inf.

Finally, I draw the marginal distributions of the data set values (which contains the 1000 elements randomly drawn from the CDF), and overlay with the marginal distributions of the CDF, to verify all is correct:

lims = [-2,5];
x = linspace(lims(1),lims(2),300);
figure
for ii=1:5
   subplot(5,1,ii)
   histogram(values(:,ii),'normalization','cdf','BinLimits',lims)
   hold on
   args = num2cell(inf(1,5));
   args{ii} = x;
   plot(x,F(args{:}))
   text(5.2,0.5,['r_',num2str(ii)])
end

Sampling from multivariate customised cumulative distribution function in Matlab

Answers (1)

Related Questions