Olga
Olga

Reputation: 289

Generating a correlation table between observations in Stata

This is a problem that I have never encountered before, hence, I don't even know where to start.

I have an unbalanced panel data set (different products sold at different stores across weeks) and would like to run correlations on sales between each product combination. The requirement is, however, a correlation is only to be calculated using the sales values of two products appearing together in the same store and week. That is to say, some weeks or some stores may sell only either of the two given products, so we just want to disregard those instances.

The number of observations in my data set is 400,000 but among them I have only 50 products sold, so the final correlation matrix would be 50*50=2500 with 1250 unique correlation values. Does it makes sense?

clear
input str2 product  sales    store week
A 10 1 1 
B 20 1 1   
C 23 1 1 
A 10 2 1 
B 30 2 1 
C 30 2 1 
F 43 2 1 
end

The correlation table should be something like this [fyi, instead of the correlation values I put square brackets to illustrate the values to be used]. Please note that I cannot run a correlation for AF because there is only one store/week combination.

     A          B               C
A    1   [10,20; 10,30]   [10,23; 10,30]
B               1         [20,23; 30,30]
C                               1

Upvotes: 0

Views: 230

Answers (1)

Nick Cox
Nick Cox

Reputation: 37278

You calculate correlations between pairs of variables; but what you regard as pairs of variables are not so in the present data layout. So, you need a reshape. The principle is shown by

clear
input str2 product  sales    store week
A 10 1 1 
B 20 1 1   
C 23 1 1 
A 10 2 1 
B 30 2 1 
C 30 2 1 
F 43 2 1 
end

reshape wide sales , i(store week) j(product) string 

rename sales* *

list

     +----------------------------------+
     | store   week    A    B    C    F |
     |----------------------------------|
  1. |     1      1   10   20   23    . |
  2. |     2      1   10   30   30   43 |
     +----------------------------------+

pwcorr A-F

             |        A        B        C        F
-------------+------------------------------------
           A |        . 
           B |        .   1.0000 
           C |        .   1.0000   1.0000 
           F |        .        .        .        . 

The results look odd only because your toy example won't allow otherwise. So A doesn't vary in your example and the correlation isn't defined. The correlation between B and C is perfect because there are two data points different in both B and C.

A different problem is that a 50 x 50 correlation matrix is unwieldy. How to get friendlier output depends on what you want to use it for.

Upvotes: 2

Related Questions