Reputation: 3326
I would like to compute the inverse cumulative density function (inverse cdf) of a given pdf. The pdf is directly given as a histogram, ie., a vector of N equally spaced components.
My current approach is to do :
cdf = cumsum(pdf);
K = 3; %// some upsampling factor
maxVal = 1; %// just for my own usage - a scaling factor
M = length(cdf);
N = M*K; %// increase resolution for higher accuracy
y = zeros(N, 1);
cursor = 2;
for i=1:N
desiredF = (i-1)/(N-1)*maxVal;
while (cursor<M && cdf(cursor)<desiredF)
cursor = cursor+1;
end;
if (cdf(cursor)==cdf(cursor-1))
y(i) = cursor-1;
else
alpha = min(1, max(0,(desiredF - cdf(cursor-1))/(cdf(cursor)-cdf(cursor-1))));
y(i) = ((cursor-1)*(1-alpha) + alpha*cursor )/maxVal;
end;
end;
y = resample(y, 1, K, 0);
which means that I upsample with linear interpolation, inverse and downsample my histogram. This is rather an ugly code, is not very robust (if I change the upsampling factor, I can get really different results), and is uselessly slow... could anyone suggest a better approach ?
Note : the generalized inverse I am trying to compute (in the case the cdf is not invertible) is :
F^{-1}(t) = \inf{x \in R ; F(x)>t }
with F the cumulative density function
[EDIT : actually, K = 1 (ie., no upsampling) seems to give more accurate results...]
Thanks!
Upvotes: 4
Views: 8950
Reputation: 3326
Ok, I think I found a much shorter version, which works at least as fast and as accurately :
cdf = cumsum(pdf);
M = length(cdf);
xx = linspace(0,1,M);
invcdf = interp1(cdf,xx,xx)
[EDIT : No this is actually still two to three times slower than the initial code... don't ask me why! And it does not handle non strictly monotonous functions : this produces the error : "The values of X should be distinct"]
Upvotes: 0
Reputation: 77404
If your input is specified in the form of a non-normalized histogram, then simply using the built-in quantile()
function automatically computes the data point for a specified quantile, which is what the inverse-CDF does. If the histogram is normalized by the number of data points (making it a probability vector), then just multiply it by the number of data points first. See here for the quantile()
details. Basically, you'll assume that given your histogram/data, the first parameter is fixed, which turns quantiles()
into a function only of the specified probability values p
. You could easily write a wrapper function to make it more convenient if necessary. This removes the need to explicitly compute the CDF with cumsum()
.
Added
If we assume the histogram, bins, and number of data points are h, b, and N
, respectively, then:
h1 = N*h; %// Only if histogram frequencies have been normalized.
data = [];
for kk = 1:length(h1)
data = [data repmat(b(kk), 1, h1(kk))];
end
%// Set p to the probability you want the inv-cdf for...
p = 0.5;
inv_cdf = quantiles(data,p)
Added
For solutions that must leverage the existing PDF vector, we can do the following. Assume that x_old
and pdf_old
are the histogram bins and histogram frequencies, respectively.
p = 0.5; %// the inv-cdf probability that I want
num_points_i_want = 100; %// the number of points I want in my histogram vector
x_new = linspace(min(x_old),max(x_old),num_points_i_want);
pdf_new = interp1(x_old,pdf_old,x_new);
cdf_new = cumsum(pdf_new);
inv_cdf = min(x_new(cdf_new >= p));
Alternatively, we could create the cumsum()
CDF first and use interp1()
on that if it's not desirable to interpolate first.
Upvotes: 4