stackname
stackname

Reputation: 111

Finding the index of a reordered vector in a reordered large matrix

Suppose I have a reordered matrix A, which is get by using A = nchoosek(1:100, 6) in MATLAB. Thus, A means all the combinations by selecting 6 digits from 1 to 100.

A is reordered, and the dim of A is very large (1e9*6), so A is like this:

 1     2     3     4     5     6
 1     2     3     4     5     7
 1     2     3     4     5     8
 .     .     .     .     .     .
 .     .     .     .     .     .
 94    96    97    98    99   100
 95    96    97    98    99   100

Then, I have another reordered vector B, which is a member of A. Such as B=[5 10 11 40 51 67];

So, how to find the index of B using the fastest way, which means utilizing the ordering information??

Upvotes: 1

Views: 64

Answers (1)

You should note how the rows of A are generated. I suspect this is not specified in the documentation, so you probably shouldn't rely on it too much. What I mean is that this is technically undocumented behaviour, and it could change with any version.

But note that each element loops over starting from the ending indices, and every combination is sorted. This means that you need to start from the beginning, in

B = [5 10 11 40 51 67];

the first index is 5, which means that all the elements starting with 4 have been enumerated. How many are those?

nchoosek(100-1,6-1)+nchoosek(100-2,6-1)+nchoosek(100-3,6-1)+nchoosek(100-4,6-1)
% [1 . . . . .]       [2 . . . . .]       [3 . . . . .]       [4 . . . . .]
%    ^ 2:100             ^ 3:100             ^ 4:100             ^ 5:100

since we have 99 numbers to choose 5 from if the first element is 1, 98 to choose 5 from if the first number is 2, etc.

Next come the elements [5 6 7 8 9 10] through [5 9 97 98 99 100]. Again, these are full combinations, if the second element is fixed:

nchoosek(100-6,6-2)+nchoosek(100-7,6-2)+nchoosek(100-8,6-2)+nchoosek(100-9,6-2)
% [5 6 . . . .]       [5 7 . . . .]       [5 8 . . . .]       [5 9 . . . .]
%      ^ 7:100             ^ 8:100             ^ 9:100             ^ 10:100

And so on, until you build up all of [5 10 11 40 50 .], which contains the last term of nchoosek(100-50,6-5). Then all that is left is to count the elements from [5 10 11 40 51 52] to [5 10 11 40 51 67], which is 67-52+1.

So, formalizing it, if your index vector B has n elements, each called b_k for k=1:n:

sum_{k=1:n} sum_{b=b_{k-1}+1:b_{k}-1} nchoosek(100-b,n-k)

which will work if we define b_0 to be zero for the first index, and note that nchoosek(m,0)==1.

So here's a straightforward function that should work. It's not optimal in its kind, but it sure beats checking 1e9 vectors against each other:

function ind = find_chooseind(N,B)
% N is the N in 1:N in nchoosek(1:N,K)
% length(B) == K

% define auxiliary B with a zero prepended to avoid out-of-bounds errors
Blen = length(B);
B = [0; B(:)];

ind = 0;
for k=2:Blen+1 % shifted for that first zero
   for b=B(k-1)+1:B(k)-1
      ind = ind + nchoosek(N-b,Blen-(k-1)); % compensate for k shift
   end
end

% testing reveals an off-by-one error, not to worry
ind = ind+1;

I tested the above code with a smaller example:

>> N = 20; K = 4;
>> A20_4 = nchoosek(1:N,K);
>> t = A20_4(randi(nchoosek(N,K)),:);
>> ind = find_chooseind(N,t);
>> A20_4(ind,:)

ans =

     7     8     9    14

>> t

t =

     7     8     9    14

Upvotes: 3

Related Questions