Reputation: 57
I have randomly missing categories in a Stata dataset that look like the following
omb_control_number agency hours
1 HHS-ACF
1 10
2
2
2 HHS-CDC 2
3
3 HHS-ACF 3
3
4 HHS-ACF 10
4
4
4
The omb_control_number variable is constant throughout the data is not missing. I am trying to impute the categories such that all unique omb_control_number have the same agency and hours. I tried using the following:
by omb_control_number, sort : replace agency[_n-1] if missing(agency)
But it filled in only previous values. Is there a way to do this where it won't just fill in previous values? For reference, the final dataset should look like the following:
omb_control_number agency hours
1 HHS-ACF 10
1 HHS-ACF 10
2 HHS-CDC 2
2 HHS-CDC 2
2 HHS-CDC 2
3 HHS-ACF 3
3 HHS-ACF 3
3 HHS-ACF 3
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
Upvotes: 0
Views: 46
Reputation: 123
This will get you the desired results:
bysort omb_control_number: gen nonmissing = sum(!missing(agency)) if !missing(agency)
bysort omb_control_number: gen nonmissing2 = sum(!missing(hours)) if !missing(hours)
bysort omb_control_number (nonmissing) : replace agency = agency[1]
bysort omb_control_number (nonmissing2) : replace hours = hours[1]
drop nonmissing*
Upvotes: 0
Reputation: 37183
If agency
is a string variable, then
bysort omb (agency) : replace agency = agency[_N]
will copy the last value after sorting to all observations for the same group.
If agency
is a numeric variable with value labels, keep reading.
As hours
is presumably a numeric variable, it is the same idea with a twist:
bysort omb (hours) : replace hours = hours[1]
In neither case is there any check for two or more non-missing values for the same identifier.
For a numeric variable, whether with or without value labels, a check would be
bysort omb (hours) : gen byte OK = (hours == hours[1]) | missing(hours)
You should then want to look if any observations are 0 on OK
. 1 means "OK".
And from the above string variables can be checked too, with a need to look in the last observation -- indexed by _N
-- rather than the first -- indexed by 1
.
Upvotes: 1
Reputation: 3255
If you do not care about maintaining original sort order, then you can do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte omb_control_number str7 agency byte hours
1 "HHS-ACF" .
1 "" 10
2 "" .
2 "" .
2 "HHS-CDC" 2
3 "" .
3 "HHS-ACF" 3
3 "" .
4 "HHS-ACF" 10
4 "" .
4 "" .
4 "" .
end
gsort omb_control_number -agency
bys omb_control_number : replace agency = agency[_n-1] if missing(agency)
sort omb_control_number hours
bys omb_control_number : replace hours = hours[_n-1] if missing(hours)
Upvotes: 1