How to Simply Create New Variable Based on Ranges of Another

Question

Say I have var1 that is continuous:

clear
set obs 1000
gen var1 = runiform()
sum var1

Now I want to create var2 based on ranges of var1. I can do this as follows:

gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75

I would like to be able to do this in one line. Pseudocode:

gen var2 = (ranges(0 .25 .5 .75 1) values("Lowest" "Low" "High" "Highest"))

A way to do something quite similar in R using cut is found at Create categorical variable in R based on range

Is there any command that can do something in Stata that is like the R version? Imagine that one has 10,000 ranges that are needing to go into var2. Then a better method would help a lot.

Another way to do this on one line in Stata is clunky and is found at http://www.stata.com/support/faqs/data-management/multiple-operations/:

generate var2 = cond(var1<=.25, "Lowest", cond(var1<=.50, "Low", cond(var1<=.75, "High", cond(var1<=1.00, "Highest", ""))))

Is there a better way?

Nick Cox · Accepted Answer

The cond() function is the supposedly clunky function alluded to. See var3 below for an example. It has the signal advantages that you can make the inequalities explicit in your code and exactly as you wish, neither of which is true of egen, cut().

In this particular example, at least one further trick is possible. See var4 below for what it is.

. clear

. set obs 15
number of observations (_N) was 0, now 15

. set seed 2803 

. gen var1 = runiform()

. sort var1 

. gen var2 = "Lowest" if var1<.25
(9 missing values generated)

. replace var2 = "Low" if var1>=.25 & var1<.5
(4 real changes made)

. replace var2 = "High" if var1>=.5 & var1<.75
(2 real changes made)

. replace var2 = "Highest" if var1>=.75
variable var2 was str6 now str7
(3 real changes made)

. gen var3 = cond(var1 < .25, "Lowest", cond(var1 <.5, "Low", cond(var1 <.75, "
> High", "Highest"))) 

. gen var4 = word("Lowest Low High Highest", ceil(4 * var1)) 

. list 

     +----------------------------------------+
     |     var1      var2      var3      var4 |
     |----------------------------------------|
  1. | .0200225    Lowest    Lowest    Lowest |
  2. | .0360774    Lowest    Lowest    Lowest |
  3. | .0934085    Lowest    Lowest    Lowest |
  4. | .0950848    Lowest    Lowest    Lowest |
  5. | .1040797    Lowest    Lowest    Lowest |
     |----------------------------------------|
  6. | .1795591    Lowest    Lowest    Lowest |
  7. | .3326341       Low       Low       Low |
  8. | .3383934       Low       Low       Low |
  9. | .3870576       Low       Low       Low |
 10. | .3980427       Low       Low       Low |
     |----------------------------------------|
 11. | .6264514      High      High      High |
 12. | .6305373      High      High      High |
 13. | .7739685   Highest   Highest   Highest |
 14. | .7935746   Highest   Highest   Highest |
 15. | .9243789   Highest   Highest   Highest |
     +----------------------------------------+

However, if you really have 10,000 ranges to specify, and they don't boil down to some simple rule, then you naturally wouldn't do it either of these ways. You should put them in a file and use some code based on a merge.

How to Simply Create New Variable Based on Ranges of Another

Answers (2)

Related Questions