Olga
Olga

Reputation: 289

Why missing values are generated during transformation of continuous variables into categorical?

I would like to transform a continuous variable avepr ∈ [0,1] into the following categories:

  0 ≤ x < 0.1
0.1 ≤ x < 0.2
0.2 ≤ x < 0.3
0.3 ≤ x < 0.4
0.4 ≤ x < 0.5
0.5 ≤ x < 0.6 
0.6 ≤ x < 0.7
0.7 ≤ x < 0.8
0.8 ≤ x < 0.9
0.9 ≤ x ≤ 1

To achieve it, I use the following code:

egen aveprcat=cut(avepr),at(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.1)

If I put 1 instead of 1.1, observations that are equal to 1 are ignored. Is this a correct way to make sure that values of 1 are not skipped?

I also receive a message that 2175 missing values were generated. I attempt to look what kind of values these are by using

tab aveprcat if avepr<=1, m
tab aveprcat,m

Here is what I get:

tab aveprcat if avepr<=1, m
 aveprcat |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    499,111       57.69       57.69
         .1 |    238,756       27.60       85.29
         .2 |     85,082        9.83       95.13
         .3 |     27,999        3.24       98.36
         .4 |      7,554        0.87       99.24
         .5 |      2,868        0.33       99.57
         .6 |        757        0.09       99.66
         .7 |        170        0.02       99.67
         .9 |      2,813        0.33      100.00
------------+-----------------------------------
      Total |    865,110      100.00




tab aveprcat,m

   aveprcat |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    499,111       57.55       57.55
         .1 |    238,756       27.53       85.08
         .2 |     85,082        9.81       94.89
         .3 |     27,999        3.23       98.12
         .4 |      7,554        0.87       98.99
         .5 |      2,868        0.33       99.32
         .6 |        757        0.09       99.41
         .7 |        170        0.02       99.42
         .9 |      2,813        0.32       99.75
          . |      2,175        0.25      100.00
------------+-----------------------------------
      Total |    867,285      100.00

I am confused what those missing values are given that I have no values greater than 1. And why these two tables are different from each other? My total number of observations is 867,285.

Upon further review,

tab aveprcat if avepr<=0.1

and

tab aveprcat if avepr<0.1

generate the same information:

   aveprcat |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    499,111      100.00      100.00
------------+-----------------------------------
      Total |    499,111      100.00

Upvotes: 1

Views: 365

Answers (2)

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

I'll try to complement the answer given by @NickCox, only insofar as the original poster still has doubts on the origin of missings.

The most important thing to realize, I believe, is the fact that missing values in Stata are considered "very large numbers". Once this is understood, the mystery fades away.

In the case of egen, cut(), the code will produce a missing if a value is not within the ranges specified by the cuts. This value could be a missing (we just said it is considered a very large number) or any arbitrary non-missing value. With respect to missings, the code by Nick will give the exact same result, but if out-of-range non-missings are present, this no longer holds. An example:

clear all
set more off

*----- example data -----

set obs 15
set seed 2134976

gen x = runiform()
replace x = 2.5 in 5  // this is an out-of-range non-missing
replace x = . in 10   // this is a missing

list

*----- cut -----

egen xcut = cut(x),at(0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1.1)
gen xnick = floor(10 * x)/10

sort x
list

The question of why the two tables

tabulate aveprcat if avepr <= 1, missing
tabulate aveprcat, missing

are different, again has to do with a missing being considered a very large number. tabulate aveprcat if avepr <= 1, missing will not show missings because you are limiting to the cases in which avepr <= 1. So any observation for avprcat that has a corresponding missing in avpr, is therefore excluded right from the beginning. Continuing with the example, we have:

. tabulate xcut, missing

       xcut |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          3       20.00       20.00
         .1 |          2       13.33       33.33
         .2 |          2       13.33       46.67
         .4 |          1        6.67       53.33
         .5 |          1        6.67       60.00
         .6 |          3       20.00       80.00
         .8 |          1        6.67       86.67
         .9 |          1        6.67       93.33
          . |          1        6.67      100.00
------------+-----------------------------------
      Total |         15      100.00

. tabulate xcut if x <= 1, missing

       xcut |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          3       21.43       21.43
         .1 |          2       14.29       35.71
         .2 |          2       14.29       50.00
         .4 |          1        7.14       57.14
         .5 |          1        7.14       64.29
         .6 |          3       21.43       85.71
         .8 |          1        7.14       92.86
         .9 |          1        7.14      100.00
------------+-----------------------------------
      Total |         14      100.00

The second command asks to tabulate xcut only if x <= 1. But for all x <= 1, egen, cut() produces no missings for xcut, so there are no missings to show. An unconditional tabulate gives a different result.

Upvotes: 3

Nick Cox
Nick Cox

Reputation: 37208

Statistically, you are just discarding information, and it's hard to imagine why that appeals.

As far as Stata is concerned: You want to bin a variable on [0,1] rounding down in intervals of .1. I'd not use egen, cut() for that at all, not least because it is tedious to look up and type out the syntax; and more importantly because I much prefer syntax that is explicit about what happens at boundaries. Given such a variable y

gen y2 = floor(10 * y)/10

would be my first call, although exact 1s will map to 1s, which is not quite what you want.

Your exploration of the missings doesn't tell us much beyond the fact that they exist. Try

tab avepr if missing(aveprcat)
su avepr if missing(aveprcat), detail 

Upvotes: 3

Related Questions