nbonneel
nbonneel

Reputation: 3326

inverse of a cdf

I would like to compute the inverse cumulative density function (inverse cdf) of a given pdf. The pdf is directly given as a histogram, ie., a vector of N equally spaced components.

My current approach is to do :

cdf = cumsum(pdf);
K = 3;   %// some upsampling factor
maxVal = 1;   %// just for my own usage - a scaling factor
M = length(cdf);
N = M*K;   %// increase resolution for higher accuracy
y = zeros(N, 1);
cursor = 2;
for i=1:N
   desiredF = (i-1)/(N-1)*maxVal;
   while (cursor<M && cdf(cursor)<desiredF)
    cursor = cursor+1;
   end;    

   if (cdf(cursor)==cdf(cursor-1))
       y(i) = cursor-1;
   else        
       alpha = min(1, max(0,(desiredF - cdf(cursor-1))/(cdf(cursor)-cdf(cursor-1))));
       y(i) = ((cursor-1)*(1-alpha) + alpha*cursor )/maxVal;
   end;

end;

y = resample(y, 1, K, 0);

which means that I upsample with linear interpolation, inverse and downsample my histogram. This is rather an ugly code, is not very robust (if I change the upsampling factor, I can get really different results), and is uselessly slow... could anyone suggest a better approach ?

Note : the generalized inverse I am trying to compute (in the case the cdf is not invertible) is :

F^{-1}(t) = \inf{x \in R ; F(x)>t }   

with F the cumulative density function

[EDIT : actually, K = 1 (ie., no upsampling) seems to give more accurate results...]

Thanks!

Upvotes: 4

Views: 8950

Answers (2)

nbonneel
nbonneel

Reputation: 3326

Ok, I think I found a much shorter version, which works at least as fast and as accurately :

cdf = cumsum(pdf);
M = length(cdf);
xx = linspace(0,1,M);
invcdf = interp1(cdf,xx,xx)

[EDIT : No this is actually still two to three times slower than the initial code... don't ask me why! And it does not handle non strictly monotonous functions : this produces the error : "The values of X should be distinct"]

Upvotes: 0

ely
ely

Reputation: 77404

If your input is specified in the form of a non-normalized histogram, then simply using the built-in quantile() function automatically computes the data point for a specified quantile, which is what the inverse-CDF does. If the histogram is normalized by the number of data points (making it a probability vector), then just multiply it by the number of data points first. See here for the quantile() details. Basically, you'll assume that given your histogram/data, the first parameter is fixed, which turns quantiles() into a function only of the specified probability values p. You could easily write a wrapper function to make it more convenient if necessary. This removes the need to explicitly compute the CDF with cumsum().

Added

If we assume the histogram, bins, and number of data points are h, b, and N, respectively, then:

 h1 = N*h; %// Only if histogram frequencies have been normalized.
 data = [];
 for kk = 1:length(h1)
     data = [data repmat(b(kk), 1, h1(kk))];
 end

 %// Set p to the probability you want the inv-cdf for...
 p = 0.5;
 inv_cdf = quantiles(data,p)

Added

For solutions that must leverage the existing PDF vector, we can do the following. Assume that x_old and pdf_old are the histogram bins and histogram frequencies, respectively.

 p = 0.5; %// the inv-cdf probability that I want
 num_points_i_want = 100; %// the number of points I want in my histogram vector

 x_new = linspace(min(x_old),max(x_old),num_points_i_want);
 pdf_new = interp1(x_old,pdf_old,x_new);
 cdf_new = cumsum(pdf_new);
 inv_cdf = min(x_new(cdf_new >= p));

Alternatively, we could create the cumsum() CDF first and use interp1() on that if it's not desirable to interpolate first.

Upvotes: 4

Related Questions