bill999
bill999

Reputation: 2550

How to Simply Create New Variable Based on Ranges of Another

Say I have var1 that is continuous:

clear
set obs 1000
gen var1 = runiform()
sum var1

Now I want to create var2 based on ranges of var1. I can do this as follows:

gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75

I would like to be able to do this in one line. Pseudocode:

gen var2 = (ranges(0 .25 .5 .75 1) values("Lowest" "Low" "High" "Highest"))

A way to do something quite similar in R using cut is found at Create categorical variable in R based on range

Is there any command that can do something in Stata that is like the R version? Imagine that one has 10,000 ranges that are needing to go into var2. Then a better method would help a lot.

Another way to do this on one line in Stata is clunky and is found at http://www.stata.com/support/faqs/data-management/multiple-operations/:

generate var2 = cond(var1<=.25, "Lowest", cond(var1<=.50, "Low", cond(var1<=.75, "High", cond(var1<=1.00, "Highest", ""))))

Is there a better way?

Upvotes: 0

Views: 7237

Answers (2)

Nick Cox
Nick Cox

Reputation: 37338

The cond() function is the supposedly clunky function alluded to. See var3 below for an example. It has the signal advantages that you can make the inequalities explicit in your code and exactly as you wish, neither of which is true of egen, cut().

In this particular example, at least one further trick is possible. See var4 below for what it is.

. clear

. set obs 15
number of observations (_N) was 0, now 15

. set seed 2803 

. gen var1 = runiform()

. sort var1 

. gen var2 = "Lowest" if var1<.25
(9 missing values generated)

. replace var2 = "Low" if var1>=.25 & var1<.5
(4 real changes made)

. replace var2 = "High" if var1>=.5 & var1<.75
(2 real changes made)

. replace var2 = "Highest" if var1>=.75
variable var2 was str6 now str7
(3 real changes made)

. gen var3 = cond(var1 < .25, "Lowest", cond(var1 <.5, "Low", cond(var1 <.75, "
> High", "Highest"))) 

. gen var4 = word("Lowest Low High Highest", ceil(4 * var1)) 

. list 

     +----------------------------------------+
     |     var1      var2      var3      var4 |
     |----------------------------------------|
  1. | .0200225    Lowest    Lowest    Lowest |
  2. | .0360774    Lowest    Lowest    Lowest |
  3. | .0934085    Lowest    Lowest    Lowest |
  4. | .0950848    Lowest    Lowest    Lowest |
  5. | .1040797    Lowest    Lowest    Lowest |
     |----------------------------------------|
  6. | .1795591    Lowest    Lowest    Lowest |
  7. | .3326341       Low       Low       Low |
  8. | .3383934       Low       Low       Low |
  9. | .3870576       Low       Low       Low |
 10. | .3980427       Low       Low       Low |
     |----------------------------------------|
 11. | .6264514      High      High      High |
 12. | .6305373      High      High      High |
 13. | .7739685   Highest   Highest   Highest |
 14. | .7935746   Highest   Highest   Highest |
 15. | .9243789   Highest   Highest   Highest |
     +----------------------------------------+

However, if you really have 10,000 ranges to specify, and they don't boil down to some simple rule, then you naturally wouldn't do it either of these ways. You should put them in a file and use some code based on a merge.

Upvotes: 4

Brendan
Brendan

Reputation: 4011

Stata does have a cut function, as part of the egen command. Using its options and defining and assigning a value label gets you the desired result (albeit in three lines rather than one, but they are three fairly concise lines). E.g.,

clear
set obs 15
gen var1 = runiform()
sum var1

gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75

// =======================================================
// Using egen , cut()
// =======================================================
label define rank 0 "Lowest" 1 "Low" 2 "High" 3 "Highest"
egen var3 = cut(var1) , at(0(.25)1) icodes
label values var3 rank

li

with the result

     +------------------------------+
     |     var1      var2      var3 |
     |------------------------------|
  1. | .6658295      High      High |
  2. | .3690664       Low       Low |
  3. | .5983131      High      High |
  4. | .2658775       Low       Low |
  5. | .1211114    Lowest    Lowest |
     |------------------------------|
  6. | .2296222    Lowest    Lowest |
  7. | .7229139      High      High |
  8. | .2501513       Low       Low |
  9. | .7775574   Highest   Highest |
 10. | .2839603       Low       Low |
     |------------------------------|
 11. | .8396428   Highest   Highest |
 12. | .4838379       Low       Low |
 13. | .2610629       Low       Low |
 14. | .3855471       Low       Low |
 15. | .3447088       Low       Low |
     +------------------------------+

Upvotes: 2

Related Questions