Reputation: 289
This is a problem that I have never encountered before, hence, I don't even know where to start.
I have an unbalanced panel data set (different products sold at different stores across weeks) and would like to run correlations on sales between each product combination. The requirement is, however, a correlation is only to be calculated using the sales values of two products appearing together in the same store and week. That is to say, some weeks or some stores may sell only either of the two given products, so we just want to disregard those instances.
The number of observations in my data set is 400,000 but among them I have only 50 products sold, so the final correlation matrix would be 50*50=2500 with 1250 unique correlation values. Does it makes sense?
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
The correlation table should be something like this [fyi, instead of the correlation values I put square brackets to illustrate the values to be used]. Please note that I cannot run a correlation for AF because there is only one store/week combination.
A B C
A 1 [10,20; 10,30] [10,23; 10,30]
B 1 [20,23; 30,30]
C 1
Upvotes: 0
Views: 230
Reputation: 37278
You calculate correlations between pairs of variables; but what you regard as pairs of variables are not so in the present data layout. So, you need a reshape
. The principle is shown by
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
reshape wide sales , i(store week) j(product) string
rename sales* *
list
+----------------------------------+
| store week A B C F |
|----------------------------------|
1. | 1 1 10 20 23 . |
2. | 2 1 10 30 30 43 |
+----------------------------------+
pwcorr A-F
| A B C F
-------------+------------------------------------
A | .
B | . 1.0000
C | . 1.0000 1.0000
F | . . . .
The results look odd only because your toy example won't allow otherwise. So A doesn't vary in your example and the correlation isn't defined. The correlation between B and C is perfect because there are two data points different in both B and C.
A different problem is that a 50 x 50 correlation matrix is unwieldy. How to get friendlier output depends on what you want to use it for.
Upvotes: 2