Reputation: 2550
Say I have var1
that is continuous:
clear
set obs 1000
gen var1 = runiform()
sum var1
Now I want to create var2
based on ranges of var1
. I can do this as follows:
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
I would like to be able to do this in one line. Pseudocode:
gen var2 = (ranges(0 .25 .5 .75 1) values("Lowest" "Low" "High" "Highest"))
A way to do something quite similar in R
using cut
is found at Create categorical variable in R based on range
Is there any command that can do something in Stata that is like the R version? Imagine that one has 10,000 ranges that are needing to go into var2
. Then a better method would help a lot.
Another way to do this on one line in Stata is clunky and is found at http://www.stata.com/support/faqs/data-management/multiple-operations/:
generate var2 = cond(var1<=.25, "Lowest", cond(var1<=.50, "Low", cond(var1<=.75, "High", cond(var1<=1.00, "Highest", ""))))
Is there a better way?
Upvotes: 0
Views: 7237
Reputation: 37338
The cond()
function is the supposedly clunky function alluded to. See var3
below for an example. It has the signal advantages that you can make the inequalities explicit in your code and exactly as you wish, neither of which is true of egen, cut()
.
In this particular example, at least one further trick is possible. See var4
below for what it is.
. clear
. set obs 15
number of observations (_N) was 0, now 15
. set seed 2803
. gen var1 = runiform()
. sort var1
. gen var2 = "Lowest" if var1<.25
(9 missing values generated)
. replace var2 = "Low" if var1>=.25 & var1<.5
(4 real changes made)
. replace var2 = "High" if var1>=.5 & var1<.75
(2 real changes made)
. replace var2 = "Highest" if var1>=.75
variable var2 was str6 now str7
(3 real changes made)
. gen var3 = cond(var1 < .25, "Lowest", cond(var1 <.5, "Low", cond(var1 <.75, "
> High", "Highest")))
. gen var4 = word("Lowest Low High Highest", ceil(4 * var1))
. list
+----------------------------------------+
| var1 var2 var3 var4 |
|----------------------------------------|
1. | .0200225 Lowest Lowest Lowest |
2. | .0360774 Lowest Lowest Lowest |
3. | .0934085 Lowest Lowest Lowest |
4. | .0950848 Lowest Lowest Lowest |
5. | .1040797 Lowest Lowest Lowest |
|----------------------------------------|
6. | .1795591 Lowest Lowest Lowest |
7. | .3326341 Low Low Low |
8. | .3383934 Low Low Low |
9. | .3870576 Low Low Low |
10. | .3980427 Low Low Low |
|----------------------------------------|
11. | .6264514 High High High |
12. | .6305373 High High High |
13. | .7739685 Highest Highest Highest |
14. | .7935746 Highest Highest Highest |
15. | .9243789 Highest Highest Highest |
+----------------------------------------+
However, if you really have 10,000 ranges to specify, and they don't boil down to some simple rule, then you naturally wouldn't do it either of these ways. You should put them in a file and use some code based on a merge
.
Upvotes: 4
Reputation: 4011
Stata does have a cut
function, as part of the egen
command. Using its options and defining and assigning a value label gets you the desired result (albeit in three lines rather than one, but they are three fairly concise lines). E.g.,
clear
set obs 15
gen var1 = runiform()
sum var1
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
// =======================================================
// Using egen , cut()
// =======================================================
label define rank 0 "Lowest" 1 "Low" 2 "High" 3 "Highest"
egen var3 = cut(var1) , at(0(.25)1) icodes
label values var3 rank
li
with the result
+------------------------------+
| var1 var2 var3 |
|------------------------------|
1. | .6658295 High High |
2. | .3690664 Low Low |
3. | .5983131 High High |
4. | .2658775 Low Low |
5. | .1211114 Lowest Lowest |
|------------------------------|
6. | .2296222 Lowest Lowest |
7. | .7229139 High High |
8. | .2501513 Low Low |
9. | .7775574 Highest Highest |
10. | .2839603 Low Low |
|------------------------------|
11. | .8396428 Highest Highest |
12. | .4838379 Low Low |
13. | .2610629 Low Low |
14. | .3855471 Low Low |
15. | .3447088 Low Low |
+------------------------------+
Upvotes: 2