MCP_infiltrator
MCP_infiltrator

Reputation: 4189

Replace values of a variable for values of other variables Stata 13

I am in my first bout with Stata. I have never used it until this week and am trying to work through some examples. I have the following set of data:

contruse | educ_none | educ_prim | educ_secabove
1        | 0         | 1         | 0
0        | 1         | 0         | 0
...

I created the following variable with corresponding data set so that I could tab contruse with all different educations.

gen education=0
replace education=1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1
replace education=. if educ_none==. | educ_prim==. | educ_secabove==.
tab education, missing

contruse | educ_none | educ_prim | educ_secabove | education
1        | 0         | 1         | 0             | 2
0        | 1         | 0         | 0             | 1

Basically is there a better way of doing this: for instance my varlist could be arbitrarily large and doing the above is painful. Is there a way of say reversing the following to work through multiple variables and give a single variable a value?

foreach x of varlist educ_none educ_prim educ_secabove {
    replace `x' = . if var > 3
}

Upvotes: 2

Views: 8285

Answers (3)

Steve Samuels
Steve Samuels

Reputation: 908

Automated approach 2014-06-02

After stating that the process of creating and labeling new variables can't be automated, I decided to try. I found two commands on SSC that help: Roger Newson's varlabdef and Daniel Klein's labvalch3. Both can be downloaded from within Stata, e.g. ssc install varlabdef.

I assume, as in the original example, that each 0-1 variable name is of the form "root_suffix", and that exactly one of the variables with the same root has value 1. The goal is to create a new variable for each root with a value that corresponds to the order of the indicator variable (if any) with value 1. The user first creates a local macro that contains all the roots. The program loops through the roots, with one variable created in each pass ; an inner loop implements Nick's solution (B); varlabdef creates value labels from the names of the original indicators; and labvalch3 strips off all but the suffix and capitalizes each item. This value label is then assigned to the new variable with a label values statement. Outside the loop, the new variables are given variable labels with label variable.

In the example that follows, there are two "roots", educ and gender. The variables with root "gender", for example, are gender_male and gender_female. A new variable gender is initialized, then assigned values 1 for males and 2 for females. A corresponding value label (also named "gender") is defined and associated with the new variable, and the variable itself is labeled "Gender".

 clear
input id educ_none educ_prim educ_secabove  gender_male gender_female
1 0 1 0  1 0
2 1 0 0  1 0
3 0 0 1  0 1
4 0 1 0  1 0
end

/* Create local macro to hold root names */
local roots educ gender

/* Loop over each root */
foreach v of local roots {
   qui gen `v' = 0  /* Initialize new variable from root */

    /* Get number of component variables */
   qui ds `v'_*
   local wc : word count `r(varlist)'

   /* Create new variables */
   forvalues k = 1/`wc' {
      /* z`k' is the k-th component variable */
      local z`k' : word `k' of `r(varlist)'  /* extended macro */
      qui replace `v' = `v'+`k'*`z`k''
      }
   /* Total components to check for missing/illegal values*/
   egen `v'tot = rowtotal(`v'_*)
   replace `v' = . if `v'tot != 1
   replace `v' = .a if `v'tot>1 & `v'tot<.
   /* Create value labels from variable names. Note that
      value labels can have same names as the
      the variables they label*/

   /* Create a value label consisting of the component variable names */
   varlabdef `v', vlist(`v'_*) from(name)
   label define `v' .a "Illegal", add

   /* Remove the roots from the labels and capitalize */
  labvalch3 `v', subst("`v'_" "")
  labvalch3 `v', strfcn(proper("@"))
  /* Assign the value labels to the new variables */
   label values `v' `v'
}
/* Give nice labels to the new variables */
label var educ "Education"
label var gender "Gender"

label list
tab educ
tab gender

The results are:

. label list
gender:
           1 Male
           2 Female
          .a Illegal
educ:
           1 None
           2 Prim
           3 Secabove
          .a Illegal

. tab educ

  Education |      Freq.     Percent        Cum.
------------+-----------------------------------
       None |          1       25.00       25.00
       Prim |          2       50.00       75.00
   Secabove |          1       25.00      100.00
------------+-----------------------------------
      Total |          4      100.00

.  tab gender

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |          3       75.00       75.00
     Female |          1       25.00      100.00
------------+-----------------------------------
      Total |          4      100.00

Upvotes: 4

Steve Samuels
Steve Samuels

Reputation: 908

Can you automate this process? The answer is "No", because each component variable will have a unique suffix. So if you have "race_black" "race_hisp_nonw" "race_white", for example, you can't process the "education" and "race" variables in the same way. You also will have unique value labels to assign to each variable. See second answer below.

Two other issues:

  1. Reading your example, it seems that for education there are exactly three categories. So you are initializing to a non-existent category.

  2. Your treatment of the missings is possibly incorrect. You've set education to missing if any of its components is missing. It's possible that an interviewer correctly coded one of the component variables as "1" and left the other values blank (missing) when they should have been coded "0". Education for that observation should not be set to missing.

Here's my idea of code:

set linesize 100
clear
input id educ_none educ_prim educ_secabove
1 0 1 0
2 1 0 0
3 0 0 1
4 . 1 .    /* Okay */
5 . . .    /* Really Missing */
6 0 0 0    /* Really Missing */
7 . 1 1    /* Illegal */
end

egen etot = rowtotal(educ_*) /* = 1 for valid values */
foreach x of varlist educ_* {
/* Tentatively fix incorrect missings */
    replace `x'= 0 if `x'==. & etot==1
    }
list
gen   education = 1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1


/* Assign extended missing for illegal values*/
replace education = .a if etot >1 & etot<.
#delim ;
label define educl
    1 "None"
    2 "Primary"
    3 "Secondary+"
    .a  ">1 indicator is 1"
 ;
#delim cr
label values education educl
list
tab education, missing

Upvotes: 4

Nick Cox
Nick Cox

Reputation: 37318

In addition to Steve Samuels' excellent suggestions, three standard devices in this territory are

A. Using recode. Check out its help.

B.

gen education = educ_none + 2 * educ_prim + 3 * educ_secabove

(which works if and only if at most one indicator is 1)

C.

gen education = cond(educ_secabove == 1, 3, 
                cond(educ_prim == 1, 2, 
                cond(educ_none == 1, 1)))

Notes:

C1. The code just above is one statement. The layout is just to help make the structure visible.

C2. Just as in elementary algebra, each left parenthesis ( implies a promise to match it by a right parenthesis ). Nesting calls to cond() doesn't change that.

C3. There is more on cond() at http://www.stata-journal.com/sjpdf.html?articlenum=pr0016

Upvotes: 4

Related Questions