Rho
Rho

Reputation: 21

stata - variable operations conditional to existent vars and to a list of varnames

I have this problem. My dataset has variables like:

sec20_var1 sec22_var1 sec30_var1 sec20_var2 sec22_var2 sec30_var2 sec31_var2

(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)

My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.

The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:

set more off
foreach v in _var02  {
    ds *`v'
    di "`r(varlist)'"
    local sects`v' `r(varlist)'
        foreach s in sec26 sec28 sec37  {
        capture confirm local sects`v'
        if !_rc {
            egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')               
    else {
        di "`v' didnt existed"
            }
}
}
}

I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.

I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.

Thanks for your help! :)

Upvotes: 1

Views: 227

Answers (2)

Rho
Rho

Reputation: 21

Thank you @SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).

I used this code, that goes in the same direction that the one made by @SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.

*COMPLETING THE LIST OF COMBINATIONS set more off foreach v in _var02 _var03 _var08 _var13 _... { foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 { capture confirm variable s'v' if _rc==111 { gen s'v'=. } } } *GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS set more off foreach v in _var02 _var03 _var08 _var13 ... { egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v') } *DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR. set more off foreach v of varlist * { gen TEMP=. replace TEMP=1 if !missing(v') egen TEMPSUM=sum(TEMP) if TEMPSUM==0 { di " >>> Dropping empty variable:v'" drop `v' } drop TEMP TEMPSUM }

Note that I cutted the list of variables.

Upvotes: 1

SOConnell
SOConnell

Reputation: 793

I will call what you are referring to as variables as "accounts".

The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:

forval sec = 1/102 {
    forval account = 1/60 {
    cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
    }
}

Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:

g start=.
forval sec = 1/102 {
    forval account = 1/60 {
    cap gen sec`sec'_var`account'=.  /*this will skip over generating the secXaccount combination if it already exists in the dataset */
    }
}
g end=.

[indicator calculations go here]

drop start-end 

However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).

Upvotes: 0

Related Questions