Reputation: 6569
Let us have a vector of size N
. For example:
x = rand(N,1)
I want to compute the minimum standard deviation of a subset of length K
in the vector.
When N
and K
are small, it is easy to find the best subset since I can use nchoosek(N,K)
to enumerate all possible subsets. But when the values of N
and K
are larger than let's say N=50
and K=25
, nchoosek
fails to compute the combinations since the size of the possible subsets are huge.
I wonder if there is a better algorithm to compute the subset that gives the smallest standard deviation in an array efficiently. For example via dynamic programming. Any ideas?
Update:
I've implemented it in a loop aftter the answers and compared to the combinatorial solution. The results are always the same but the speed gain are unprecedented.
n = 20;
k = 10;
x = rand(n,1);
C = nchoosek(x, k);
tic
mins = realmax;
for i = 1:size(C,1)
s = std(C(i,:));
if s < mins
mins = s;
bestC = C(i,:);
end
end
toc
tic
[x2, j] = sort(x);
mins2 = realmax;
for i = 1:(n-k+1)
s = std(x2(i:(i+k-1)));
if s < mins2
mins2 = s;
idx = j((i:(i+k-1)));
end
end
toc
if mins == mins2
'Equal'
end
gives
Elapsed time is 7.786579 seconds.
Elapsed time is 0.002068 seconds.
ans =
Equal
Upvotes: 4
Views: 1727
Reputation: 32521
Sort the array and then compute this in one pass with rolling window of length K
.
I'm sure this will give you the right answer, will think if I can prove it.
Hand-wavey argument, with a possible gap in logic in the "extend this" part:
Consider an element x
from your list. Let's try and find the minimum standard deviation of a set of size 2 containing this element. We'll get this by choosing x
and the closest element to x
. Extend this to k
elements and we'll get a set that is a contiguous part of the sorted list that contains x
. To pick the smallest subset of k
elements (i.e for any x
) we therefore just have to iterate over the sorted list as described before.
Upvotes: 3