Reputation: 4189
I am in my first bout with Stata. I have never used it until this week and am trying to work through some examples. I have the following set of data:
contruse | educ_none | educ_prim | educ_secabove
1 | 0 | 1 | 0
0 | 1 | 0 | 0
...
I created the following variable with corresponding data set so that I could tab
contruse
with all different educations.
gen education=0
replace education=1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1
replace education=. if educ_none==. | educ_prim==. | educ_secabove==.
tab education, missing
contruse | educ_none | educ_prim | educ_secabove | education
1 | 0 | 1 | 0 | 2
0 | 1 | 0 | 0 | 1
Basically is there a better way of doing this: for instance my varlist could be arbitrarily large and doing the above is painful. Is there a way of say reversing the following to work through multiple variables and give a single variable a value?
foreach x of varlist educ_none educ_prim educ_secabove {
replace `x' = . if var > 3
}
Upvotes: 2
Views: 8285
Reputation: 908
Automated approach 2014-06-02
After stating that the process of creating and labeling new variables can't be automated, I decided to try. I found two commands on SSC that help: Roger Newson's varlabdef and Daniel Klein's labvalch3. Both can be downloaded from within Stata, e.g. ssc install varlabdef
.
I assume, as in the original example, that each 0-1 variable name is of the form "root_suffix", and that exactly one of the variables with the same root has value 1. The goal is to create a new variable for each root with a value that corresponds to the order of the indicator variable (if any) with value 1. The user first creates a local macro that contains all the roots. The program loops through the roots, with one variable created in each pass ; an inner loop implements Nick's solution (B); varlabdef creates value labels from the names of the original indicators; and labvalch3 strips off all but the suffix and capitalizes each item. This value label is then assigned to the new variable with a label values
statement. Outside the loop, the new variables are given variable labels with label variable
.
In the example that follows, there are two "roots", educ
and gender
. The variables with root "gender", for example, are gender_male
and gender_female
. A new variable gender
is initialized, then assigned values 1 for males and 2 for females. A corresponding value label (also named "gender") is defined and associated with the new variable, and the variable itself is labeled "Gender".
clear
input id educ_none educ_prim educ_secabove gender_male gender_female
1 0 1 0 1 0
2 1 0 0 1 0
3 0 0 1 0 1
4 0 1 0 1 0
end
/* Create local macro to hold root names */
local roots educ gender
/* Loop over each root */
foreach v of local roots {
qui gen `v' = 0 /* Initialize new variable from root */
/* Get number of component variables */
qui ds `v'_*
local wc : word count `r(varlist)'
/* Create new variables */
forvalues k = 1/`wc' {
/* z`k' is the k-th component variable */
local z`k' : word `k' of `r(varlist)' /* extended macro */
qui replace `v' = `v'+`k'*`z`k''
}
/* Total components to check for missing/illegal values*/
egen `v'tot = rowtotal(`v'_*)
replace `v' = . if `v'tot != 1
replace `v' = .a if `v'tot>1 & `v'tot<.
/* Create value labels from variable names. Note that
value labels can have same names as the
the variables they label*/
/* Create a value label consisting of the component variable names */
varlabdef `v', vlist(`v'_*) from(name)
label define `v' .a "Illegal", add
/* Remove the roots from the labels and capitalize */
labvalch3 `v', subst("`v'_" "")
labvalch3 `v', strfcn(proper("@"))
/* Assign the value labels to the new variables */
label values `v' `v'
}
/* Give nice labels to the new variables */
label var educ "Education"
label var gender "Gender"
label list
tab educ
tab gender
The results are:
. label list
gender:
1 Male
2 Female
.a Illegal
educ:
1 None
2 Prim
3 Secabove
.a Illegal
. tab educ
Education | Freq. Percent Cum.
------------+-----------------------------------
None | 1 25.00 25.00
Prim | 2 50.00 75.00
Secabove | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00
. tab gender
Gender | Freq. Percent Cum.
------------+-----------------------------------
Male | 3 75.00 75.00
Female | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00
Upvotes: 4
Reputation: 908
Can you automate this process? The answer is "No", because each component variable will have a unique suffix. So if you have "race_black" "race_hisp_nonw" "race_white", for example, you can't process the "education" and "race" variables in the same way. You also will have unique value labels to assign to each variable. See second answer below.
Two other issues:
Reading your example, it seems that for education there are exactly three categories. So you are initializing to a non-existent category.
Your treatment of the missings is possibly incorrect. You've set education to missing if any of its components is missing. It's possible that an interviewer correctly coded one of the component variables as "1" and left the other values blank (missing) when they should have been coded "0". Education for that observation should not be set to missing.
Here's my idea of code:
set linesize 100
clear
input id educ_none educ_prim educ_secabove
1 0 1 0
2 1 0 0
3 0 0 1
4 . 1 . /* Okay */
5 . . . /* Really Missing */
6 0 0 0 /* Really Missing */
7 . 1 1 /* Illegal */
end
egen etot = rowtotal(educ_*) /* = 1 for valid values */
foreach x of varlist educ_* {
/* Tentatively fix incorrect missings */
replace `x'= 0 if `x'==. & etot==1
}
list
gen education = 1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1
/* Assign extended missing for illegal values*/
replace education = .a if etot >1 & etot<.
#delim ;
label define educl
1 "None"
2 "Primary"
3 "Secondary+"
.a ">1 indicator is 1"
;
#delim cr
label values education educl
list
tab education, missing
Upvotes: 4
Reputation: 37318
In addition to Steve Samuels' excellent suggestions, three standard devices in this territory are
A. Using recode
. Check out its help.
B.
gen education = educ_none + 2 * educ_prim + 3 * educ_secabove
(which works if and only if at most one indicator is 1)
C.
gen education = cond(educ_secabove == 1, 3,
cond(educ_prim == 1, 2,
cond(educ_none == 1, 1)))
Notes:
C1. The code just above is one statement. The layout is just to help make the structure visible.
C2. Just as in elementary algebra, each left parenthesis (
implies a promise to match it by a right parenthesis )
. Nesting calls to cond()
doesn't change that.
C3. There is more on cond()
at http://www.stata-journal.com/sjpdf.html?articlenum=pr0016
Upvotes: 4