the_economist
the_economist

Reputation: 551

Stata: ignore missing values when counting

I'd like to create a variable (counter) which counts non-missing values of another variable (firmage0) in every year. My code looks like this:

sort year  
quietly by year: gen counter = _n if firmage0 != . 

Unfortunately, Stata starts counting with 1 even if there are missing values. But I'd like to start counting when there is a non-missing value. In addition the second nonmissing value should have the value "2", the third "3" and so on. There is one important restriction. Don't use preserve and restore! After creating the variable counter, my dataset looks like this (this is only a small part of it). Free space means missing value!

year    firmage0    counter
1975        
1975        
1975        
1975        
1975    21  5
1975        
1975    60  7
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975    115 20
1975        
1975        
1975

But the dataset should look like this:

year    firmage0    counter
1975        
1975        
1975        
1975        
1975    21  1
1975        
1975    60  2
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975    115 3
1975        
1975        
1975    

Upvotes: 3

Views: 7970

Answers (2)

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

The system variable _n is fixed in the sense that it denotes the number of the current observation independently of your if condition. What you are asking for with your code is a counter that takes on the value of the number of the observation if it is not missing. This is different from the count of non-missing observations up until a certain point, which is what you really want.

Be aware of the data type of your missings. As you present it, firmage0 seems like a string variable because blanks are missings for strings. Missings for numeric values are represented as . (or .<something>). See help missings.

One solution is doing a cumulative sum conditional on the missings. I assume variable type is numeric:

clear all
set more off

input year firmage0
1975      .  
1975      . 
1975      .      
1975      .  
1975    21
1980    60 
1980    .
1980    89       
1985      . 
1985      .  
1985    23    
1985    45    
1985    47
1985     .   
end 

list, sepby(year)

bysort year: gen counter = sum(!missing(firmage0))
replace counter = . if missing(firmage0) // optional

list, sepby(year)

!missing(firmage0) evaluates to false (0) or true (1) depending on whether the observation for firmage0 is missing or not, respectively. Then sum does the cumulative sum of that.

If the variable is in fact of string type, then replace the corresponding line with:

replace counter = "" if missing(firmage0) // optional

A one-liner (for numeric type) is:

bysort year: gen counter = cond(missing(firmage0), ., sum(!missing(firmage0)))

but I prefer the first way because it is easier to read.

Upvotes: 3

I believe that if you sort by year and firmage0 before:

. sort year firmage0

you will have your problem solved. I changed the year of one of the firmage0 values just to confirm that it would restart the count.

. by year : gen counter = _n if firmage0 !=.

year    firmage0    counter               
1975    21  1     
1975    60  2   
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1975        
1976    115 1

Upvotes: 1

Related Questions