Reputation: 65
I have seen questions for this in R and Python (e.g., NA filling only if "sandwiched" by the same value using dplyr), but can't find an answer of how to do this in Stata.
I would like to fill in missing values in a panel dataset ONLY if it is in between 2 equal non-missing values. For example, if I have
clear
input score group
2005 1
2006 .
2007 .
2008 .
2009 1
2010 8
2011 1
2012 .
2013 5
2014 3
2015 4
end
the missing observations in 2006, 2007, and 2008 would be filled in to become 1, but the missing observation in 2012 would not be filled in. The missing values are categorical and not ordinal.
I have tried filling in with the prior value if the prior value and next value are equal, but this doesn't account for situations where there are more than 1 missing sandwiched between two equal non-missing values.
Edit: The solution provided below works in some cases but acts unexpectedly in others (I've changed some of the variable names to make it more intuitive and to include two examples in one block). Suppose I start with
clear
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 .
2009 2 .
2010 2 2
2011 2 2
end
Using the solution, sometimes I get
clear
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 3
2009 2 3
2010 2 2
2011 2 2
end
For group 1, the non-missings do not fill in when they should. For group 2, they do fill in but should not because the non-missing values at the ends of the sandwich are not equal.
Upvotes: 0
Views: 211
Reputation: 37208
Here is one way to do it.
One method of interpolation or imputation is just to carry the last known value forward. In Stata this is discussed at https://www.stata.com/support/faqs/data-management/replacing-missing-values/
Another method is linear interpolation. Linear interpolation will produce the same answer as method 1 if, and only if, the interpolation is flat, so that interpolated values equal known values at each end.
No doubt you could just interpolate and check for constancy, but my guess is that would be about as much code as is given here.
clear
input score group
2005 1
2006 .
2007 .
2008 .
2009 1
2010 8
2011 1
2012 .
2013 5
2014 3
2015 4
end
clonevar guess1 = group
replace guess1 = guess1[_n-1] if missing(guess1)
ipolate group score, gen(guess2)
gen goodguess = guess1 if guess1 == guess2
list, sepby(goodguess)
+--------------------------------------------+
| score group guess1 guess2 goodgu~s |
|--------------------------------------------|
1. | 2005 1 1 1 1 |
2. | 2006 . 1 1 1 |
3. | 2007 . 1 1 1 |
4. | 2008 . 1 1 1 |
5. | 2009 1 1 1 1 |
|--------------------------------------------|
6. | 2010 8 8 8 8 |
|--------------------------------------------|
7. | 2011 1 1 1 1 |
|--------------------------------------------|
8. | 2012 . 1 3 . |
|--------------------------------------------|
9. | 2013 5 5 5 5 |
|--------------------------------------------|
10. | 2014 3 3 3 3 |
|--------------------------------------------|
11. | 2015 4 4 4 4 |
+--------------------------------------------+
You allude to panel data without exemplifying it, but this method can be extended easily to panel data. Use a by:
or a by()
option as needed.
EDIT The late detail that the variable concerned is categorical not ordinal (meaning, nominal not ordinal; many if not most authorities regard "ordinal" as one kind of categorical; witness Alan Agresti's texts) does not affect the solution here. The copying downwards method will never suggest a different category. ipolate
treats data as if linear interpolation makes sense; when it suggests an unacceptable replacement value, that will not be accepted by the code above.
EDIT 2 You are getting results that you don't want because you are applying code for one problem to quite a different problem. Evidently three variables, not two, are involved, contrary to the initial question. In that case, the code should be different.
clear
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 .
2009 2 .
2010 2 2
2011 2 2
end
clonevar guess1 = id
bysort group (year) : replace guess1 = guess1[_n-1] if missing(guess1)
ipolate id year, gen(guess2) by(group)
gen goodguess = guess1 if guess1 == guess2
Otherwise put, the initial answer worked from the example given in which the observations were in year
order and there was no grouping variable; hence there was only one observation for each distinct year
.
When there is other structure (e.g. panel or longitudinal data) you need other code to match it.
Upvotes: 1