Reputation: 289
I would like to transform a continuous variable avepr
∈ [0,1] into the following categories:
0 ≤ x < 0.1
0.1 ≤ x < 0.2
0.2 ≤ x < 0.3
0.3 ≤ x < 0.4
0.4 ≤ x < 0.5
0.5 ≤ x < 0.6
0.6 ≤ x < 0.7
0.7 ≤ x < 0.8
0.8 ≤ x < 0.9
0.9 ≤ x ≤ 1
To achieve it, I use the following code:
egen aveprcat=cut(avepr),at(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.1)
If I put 1 instead of 1.1, observations that are equal to 1 are ignored. Is this a correct way to make sure that values of 1 are not skipped?
I also receive a message that 2175 missing values were generated. I attempt to look what kind of values these are by using
tab aveprcat if avepr<=1, m
tab aveprcat,m
Here is what I get:
tab aveprcat if avepr<=1, m
aveprcat | Freq. Percent Cum.
------------+-----------------------------------
0 | 499,111 57.69 57.69
.1 | 238,756 27.60 85.29
.2 | 85,082 9.83 95.13
.3 | 27,999 3.24 98.36
.4 | 7,554 0.87 99.24
.5 | 2,868 0.33 99.57
.6 | 757 0.09 99.66
.7 | 170 0.02 99.67
.9 | 2,813 0.33 100.00
------------+-----------------------------------
Total | 865,110 100.00
tab aveprcat,m
aveprcat | Freq. Percent Cum.
------------+-----------------------------------
0 | 499,111 57.55 57.55
.1 | 238,756 27.53 85.08
.2 | 85,082 9.81 94.89
.3 | 27,999 3.23 98.12
.4 | 7,554 0.87 98.99
.5 | 2,868 0.33 99.32
.6 | 757 0.09 99.41
.7 | 170 0.02 99.42
.9 | 2,813 0.32 99.75
. | 2,175 0.25 100.00
------------+-----------------------------------
Total | 867,285 100.00
I am confused what those missing values are given that I have no values greater than 1. And why these two tables are different from each other? My total number of observations is 867,285.
Upon further review,
tab aveprcat if avepr<=0.1
and
tab aveprcat if avepr<0.1
generate the same information:
aveprcat | Freq. Percent Cum.
------------+-----------------------------------
0 | 499,111 100.00 100.00
------------+-----------------------------------
Total | 499,111 100.00
Upvotes: 1
Views: 365
Reputation: 11102
I'll try to complement the answer given by @NickCox, only insofar as the original poster still has doubts on the origin of missings.
The most important thing to realize, I believe, is the fact that missing values in Stata are considered "very large numbers". Once this is understood, the mystery fades away.
In the case of egen, cut()
, the code will produce a missing if a value is not within the ranges specified by the cuts. This value could be a missing (we just said it is considered a very large number) or any arbitrary non-missing value. With respect to missings, the code by Nick will give the exact same result, but if out-of-range non-missings are present, this no longer holds. An example:
clear all
set more off
*----- example data -----
set obs 15
set seed 2134976
gen x = runiform()
replace x = 2.5 in 5 // this is an out-of-range non-missing
replace x = . in 10 // this is a missing
list
*----- cut -----
egen xcut = cut(x),at(0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1.1)
gen xnick = floor(10 * x)/10
sort x
list
The question of why the two tables
tabulate aveprcat if avepr <= 1, missing
tabulate aveprcat, missing
are different, again has to do with a missing being considered a very large number. tabulate aveprcat if avepr <= 1, missing
will not show missings because you are limiting to the cases in which avepr <= 1
. So any observation for avprcat
that has a corresponding missing in avpr
, is therefore excluded right from the beginning. Continuing with the example, we have:
. tabulate xcut, missing
xcut | Freq. Percent Cum.
------------+-----------------------------------
0 | 3 20.00 20.00
.1 | 2 13.33 33.33
.2 | 2 13.33 46.67
.4 | 1 6.67 53.33
.5 | 1 6.67 60.00
.6 | 3 20.00 80.00
.8 | 1 6.67 86.67
.9 | 1 6.67 93.33
. | 1 6.67 100.00
------------+-----------------------------------
Total | 15 100.00
. tabulate xcut if x <= 1, missing
xcut | Freq. Percent Cum.
------------+-----------------------------------
0 | 3 21.43 21.43
.1 | 2 14.29 35.71
.2 | 2 14.29 50.00
.4 | 1 7.14 57.14
.5 | 1 7.14 64.29
.6 | 3 21.43 85.71
.8 | 1 7.14 92.86
.9 | 1 7.14 100.00
------------+-----------------------------------
Total | 14 100.00
The second command asks to tabulate xcut
only if x <= 1
. But for all x <= 1
, egen, cut()
produces no missings for xcut
, so there are no missings to show. An unconditional tabulate
gives a different result.
Upvotes: 3
Reputation: 37208
Statistically, you are just discarding information, and it's hard to imagine why that appeals.
As far as Stata is concerned: You want to bin a variable on [0,1] rounding down in intervals of .1. I'd not use egen, cut()
for that at all, not least because it is tedious to look up and type out the syntax; and more importantly because I much prefer syntax that is explicit about what happens at boundaries. Given such a variable y
gen y2 = floor(10 * y)/10
would be my first call, although exact 1s will map to 1s, which is not quite what you want.
Your exploration of the missings doesn't tell us much beyond the fact that they exist. Try
tab avepr if missing(aveprcat)
su avepr if missing(aveprcat), detail
Upvotes: 3