MATLAB: Efficient way to calculate a covariance matrix from this data

Question

I have a data file in which there are N=428 subjects each answering the same 8 questions. It looks like this:

 question subject  score    
    1        1       42         
    2        1       12
    3        1       13
    4        1       43
    5        1       22
    6        1       43 
    7        1       54
    8        1       66
    1        2       41
    2        2       11
   ...      ...     ...

I want to calculate and store a covariance matrix reflecting the scores of each subject.

So cell (1,1) has subject 1's variance. then cells (1,2) and (2,1) would both have the same value, i.e. the covariance between subject 1 and subject 2. Although in the table above you can't see all of subject 2's data, it looks like they would have some positive covariance with subject 1.

n choose k unique covariances must be calculated, which I work out to be 91378 total.

How can I efficiently achieve this?

EDIT: Using the code from @GameOfThrows I was able to get a working version going using a loop:

crowd_cov = NaN(428,428);

for i = 1:length(allpairs)
    Z = cov(score(indexSub1(i,1):indexSub1(i,2)),score(indexSub2(i,1):indexSub2(i,2)));
    first = allpairs(i,1);
    second = allpairs(i,2);
    crowd_cov(first,first) = Z(1,1);
    crowd_cov(second,second) = Z(2,2);
    crowd_cov(first,second) = Z(1,2);
    crowd_cov(second,first) = Z(2,1);
end

I'm happy with this, although I would still welcome an explanation of how I could have coded this more efficiently.

GameOfThrows · Accepted Answer

so you want the co-variance, which tells me that you have two random variables, say score of subject 1 and score of subject 2, let's for now hope that the column Questions will not play a big part in this, but if number of Questions are the same for each subject, then it will hugely increase the efficiency of your program (because it allows fast indexing).

allpairs = combnk(1:max(subject),2) %// all possible combinations of subjects starting from subject 1 to subject N and the 2 means you want pairs.

Now note that this does not have repetitions, so sub 1 vs sub 2 only happens once, sub 2 vs sub 1 does not exist.

Now you want to do matlab cov to each pair (you need to index correctly to the score). This is where if you have the same number of questions, it will save you a lot of time, say 8 questions for each subject:

indexSub1 = [(allpairs(:,1)*8 -7),(allpairs(:,1)*8)]
indexSub2 = [(allpairs(:,2)*8 -7),(allpairs(:,2)*8)]

now you have all your correct indexes, you can use cov; as a function, apply it to every 8 elements of

cov(score(indexSub1),score(indexSub2)).

If the number of questions are not the same, then you might have to use find to index correctly, this would slow your program down a bit.

At the end you can convert your matrix to a cell and use cellfun to apply cov, or you can use a loop for simpler representations(am I suggesting a loop? No).

EDIT:

To clarify, what I am suggesting is that you have your indexSub1 and indexSub2, you can convert these into 91378*2 cells, where each cell consists of 8 scores. This will allow you to use Matlab's cellfun (where a function is applied to each cell). This will drastically increase your speed.

MATLAB: Efficient way to calculate a covariance matrix from this data

Answers (1)

Related Questions