Huggzorx
Huggzorx

Reputation: 166

Select n weighted elements by index from a very large array in MATLAB

Suppose I have a very large square matrix, M(i, j), such that each element in the matrix represents the probability that the element will be chosen in a weighted random selection. I need to sample n elements from the matrix (by the (i, j) indices) with replacement. The weights will change on every iteration of the main loop.

Currently, I am using something like the following:

for m = 1:M_size
    xMean(m) = mean(M(:, m));
end

[~, j_list] = histc(rand(n, 1), cumsum([0; xMean'./sum(xMean)']));
for c = 1:n
    [~, i_list(c)] = ...
      histc(rand(1, 1), cumsum([0;, M(:, j_list(c))./sum(M(:, j_list(c)))]));
end

But this seems to be a rather clunky method, that also takes a very long time due to the for loop. Is there a more efficient method? Perhaps if I vectorize the matrix in some way?

*Edit I should mention that I do not have access to a statistics toolbox

Many thanks in advance.

Upvotes: 0

Views: 423

Answers (3)

Pursuit
Pursuit

Reputation: 12345

I think I would actually solve this by un-vectorizing. That is, remove all of the high level calls and expensive operations and strip it down to the essentials, using only predefined arrays and simple operations.

The core of an algorithm would be:

  1. Determine the sum of the weights

  2. Select n random numbers between 0 and the sum of the weights, sort them.

  3. Manually implement a cumsum loop. However, instead of storing all cumulative sums, just store the indexes where the cumsum jumps from less than the current random number to more than the current random number.

In code (with a bit of a timing rig), that looks like this:

tic
for ixTiming = 1:1000

    M = abs(randn(50));
    M_size = size(M, 2);
    n = 8;
    total = sum(M(:));

    randIndexes = sort(rand(n,1) * total);

    list = zeros(n,1);
    ixM = 1;
    ixNextList = 1;
    curSum = 0;
    while ixNextList<=n  && ixM<numel(M)
        while curSum<randIndexes(ixNextList) && ixM<=numel(M)
            curSum = curSum+M(ixM);
            ixM = ixM + 1;
        end
        list(ixNextList) = ixM;
        ixNextList = ixNextList+1;
    end
    [i_list, j_list] = ind2sub(size(M),list);

end
toc; %0.216 sec. on my computer

Compare this to the code in the original question:

tic
for ixTiming = 1:1000
    M = abs(randn(50));
    M_size = size(M, 2);
    n = 8;

    for m = 1:M_size
        xMean(m) = mean(M(:, m));
    end

    [~, j_list] = histc(rand(n, 1), cumsum([0; xMean'./sum(xMean)']));
    for c = 1:n
        [~, i_list(c)] = ...
            histc(rand(1, 1), cumsum([0;, M(:, j_list(c))./sum(M(:, j_list(c)))]));
    end
end
toc;  %1.10 sec on my computer

Caveats and optimizations.

  • I have not extensively tested this. Random number operations are hard to for propery random behavior. Run a few test cases over a lot of monte carlo sets to make sure the behavior is as expected. Especially watch out for off-by-one type errors.

  • Profile, and then look for additional improvements in any slow steps. Some possibilities.

    • Maintain the total value as you change M, so you don;t need to recalculate.

    • Check the lowest and highest value of randIndexes against 0 and total. If randIndexes(1) is larger thantotal-randIndexes(end), then incrementixMfromnumel(M)to1, rather than from1tonumel(M)`.

Upvotes: 0

macduff
macduff

Reputation: 4685

% M is ixj
xMean = transpose(mean(M,1));
%xMean is jx1, so i hope n == j
[~, j_list] = histc(rand(n, 1), cumsum([0; xMean./sum(xMean)]));
% j_list is not used? but is j x 1
cumsumvals = cumsum([zeros(1,jj);, M(:,j_list(1:n))./kron(sum(M(:,j_list(1:n))),ones(ii,1))],1),1)
% cumsumvals is i+1 x j, so looks like it should work
% but histc won't work with a matrix valued edge parameter
% you'll need to look into hist3 for that
for c = 1:n
    [~, i_list(c)] = ...
      histc(rand(1, 1), cumsumvals(:,c));
end

So it's closer, but you'll need hist3 to make completely vectorized.

Upvotes: 0

AE426082
AE426082

Reputation: 633

randsample (docs) is your friend here. I would use the following method which converts to indexes then back to subscripts:

selected_indexes = randsample(1:numel(M), n, true, M(:));
[sub_i, sub_j] = ind2sub(size(M), selected_indexes);

You might have to do a few transposes on M to get the appropriate dimensions.

Upvotes: 1

Related Questions