Star
Star

Reputation: 2299

Transforming draws in Matlab from Gaussian mixture to uniform

Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.

P=10^3; %number draws
v=1;

%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];

%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];


%Combine    
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);

%Draws
RV_temp = random(obj,P);%Px2

% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf'); 

Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing

ecdf(RV1)
ecdf(RV2)

we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.

I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?

Upvotes: 5

Views: 323

Answers (2)

Nicky Mattsson
Nicky Mattsson

Reputation: 3052

When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.

Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.

As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.

Upvotes: 2

Armand Chocron
Armand Chocron

Reputation: 149

I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:

Result with 100000 Samples

With 1000 I obtained this:

Result with 1000 Samples

Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.

Upvotes: 0

Related Questions