Michelle
Michelle

Reputation: 55

Stata - assign different variables depending on the value within a variable

Sorry that title is confusing. Hopefully it's clear below.

I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.

enter image description here
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.

  forvalues i = 1/20 {
   forvalues j = 1/20 {
     replace variable`j' = 1 if order`i'==`j'
   }
  }

Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.

  forvalues i = 1/20 {
     replace variable[`order`i'value] = 1 

  }

Thanks for your help!

***** CLARIFICATION ADDED Feb 2nd.** I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.

To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.

For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).

So with the additional loop what I do is something more like this

forvalues k = 1/9{
  forvalues i = 1/20 {
   forvalues j = 1/20 {
     replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
   }
  }
}

Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.

 forvalues k=1/9{
  forvalues i = 1/20 {
     replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
  }
 }

Upvotes: 1

Views: 2923

Answers (3)

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.

*----------------- generate some data ----------------------

clear all
set more off

local numobs 60

set obs `numobs'

quietly {
forvalues i = 1/`numobs' {
    generate order`i' = .
        local k = (`numobs' - `i' + 1)
        forvalues j = 1/`k' {
            replace order`i' = (`k' - `j' + 1) if (_n == `j')
        }
}
}

timer clear

*------------- method 1 (gen + missing()) ------------------

timer on 1

quietly {
    forvalues i = 1/`numobs' {
        generate variable`i' = !missing(order`i')
    }
}

timer off 1

* ----------- method 2 (reshape + missing()) ---------------

drop variable*

timer on 2

quietly {
    generate id = _n
    reshape long order, i(id)
    generate variable = !missing(order)
    reshape wide order variable, i(id) j(_j)
}

timer off 2

*--------------- method 3 (egen, rowmax()) -----------------

drop variable*

timer on 3

quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
    tab order`v', gen(var`v'_)
}


// loop over the domain of the order variables 
// (may need to change)
forvalues l=1/`numobs' { 
    egen variable`l' = rmax(var*_`l')
    drop var*_`l'
}
}

timer off 3

*----------------- method 4 (original post) ----------------

drop variable*

timer on 4

quietly {
forvalues i = 1/`numobs' {
    gen variable`i' = 0

    forvalues j = 1/`numobs' {
        replace variable`i' = 1 if order`i'==`j'
    }
}
}

timer off 4

*-----------------------------------------------------------

timer list

The timed procedures give

. timer list
   1:      0.00 /        1 =       0.0010
   2:      0.30 /        1 =       0.3000
   3:      0.34 /        1 =       0.3390
   4:      0.07 /        1 =       0.0700

where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.

The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.

I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.

Note: I've reused code from other posters' answers.

Upvotes: 2

dimitriy
dimitriy

Reputation: 9460

Here's a simpler way to do it (that still requires 2 loops):

// loop over the order variables creating dummies
forvalues v=1/20 {
    tab order`v', gen(var`v'_)
}


// loop over the domain of the order variables (may need to change)
forvalues l=1/3 { 
    egen variable`l' = rmax(var*_`l')
    drop var*_`l'
}

Upvotes: 1

Richard Herron
Richard Herron

Reputation: 10092

I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.

* generate your data
clear
set obs 3
forvalues i = 1/3 {
    generate order`i' = .
        local k = (3 - `i' + 1)
        forvalues j = 1/`k' {
            replace order`i' = (`k' - `j' + 1) if (_n == `j')
        }
}
list


*. list
*
*     +--------------------------+
*     | order1   order2   order3 |
*     |--------------------------|
*  1. |      3        2        1 |
*  2. |      2        1        . |
*  3. |      1        .        . |
*     +--------------------------+

* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list


*. list
*
*     +-----------------------------------------------------------+
*     | order1   order2   order3   variab~1   variab~2   variab~3 |
*     |-----------------------------------------------------------|
*  1. |      3        2        1          1          1          1 |
*  2. |      2        1        .          1          1          0 |
*  3. |      1        .        .          1          0          0 |
*     +-----------------------------------------------------------+

Upvotes: 2

Related Questions