coinbase_wells
coinbase_wells

Reputation: 57

Impute missing categories

I have randomly missing categories in a Stata dataset that look like the following

omb_control_number     agency     hours
1                      HHS-ACF    
1                                 10
2                      
2
2                      HHS-CDC    2
3                      
3                      HHS-ACF    3
3
4                      HHS-ACF    10
4
4
4

The omb_control_number variable is constant throughout the data is not missing. I am trying to impute the categories such that all unique omb_control_number have the same agency and hours. I tried using the following:

by omb_control_number, sort : replace agency[_n-1] if missing(agency)

But it filled in only previous values. Is there a way to do this where it won't just fill in previous values? For reference, the final dataset should look like the following:

omb_control_number     agency     hours
1                      HHS-ACF    10 
1                      HHS-ACF    10
2                      HHS-CDC    2
2                      HHS-CDC    2
2                      HHS-CDC    2
3                      HHS-ACF    3
3                      HHS-ACF    3
3                      HHS-ACF    3
4                      HHS-ACF    10
4                      HHS-ACF    10
4                      HHS-ACF    10
4                      HHS-ACF    10

Upvotes: 0

Views: 46

Answers (3)

24thDan
24thDan

Reputation: 123

This will get you the desired results:

bysort omb_control_number: gen nonmissing = sum(!missing(agency)) if !missing(agency)
bysort omb_control_number: gen nonmissing2 = sum(!missing(hours)) if !missing(hours)
bysort omb_control_number (nonmissing) : replace agency = agency[1]
bysort omb_control_number (nonmissing2) : replace hours = hours[1]
drop nonmissing*

Upvotes: 0

Nick Cox
Nick Cox

Reputation: 37183

If agency is a string variable, then

bysort omb (agency) : replace agency = agency[_N] 

will copy the last value after sorting to all observations for the same group.

If agency is a numeric variable with value labels, keep reading.

As hours is presumably a numeric variable, it is the same idea with a twist:

bysort omb (hours) : replace hours = hours[1] 

In neither case is there any check for two or more non-missing values for the same identifier.

For a numeric variable, whether with or without value labels, a check would be

bysort omb (hours) : gen byte OK = (hours == hours[1]) |  missing(hours) 

You should then want to look if any observations are 0 on OK. 1 means "OK".

And from the above string variables can be checked too, with a need to look in the last observation -- indexed by _N-- rather than the first -- indexed by 1.

Upvotes: 1

TheIceBear
TheIceBear

Reputation: 3255

If you do not care about maintaining original sort order, then you can do this:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte omb_control_number str7 agency byte hours
1 "HHS-ACF"  .
1 ""        10
2 ""         .
2 ""         .
2 "HHS-CDC"  2
3 ""         .
3 "HHS-ACF"  3
3 ""         .
4 "HHS-ACF" 10
4 ""         .
4 ""         .
4 ""         .
end

gsort omb_control_number -agency
bys  omb_control_number : replace agency = agency[_n-1] if missing(agency)

sort omb_control_number hours
bys  omb_control_number : replace hours = hours[_n-1] if missing(hours)

Upvotes: 1

Related Questions