username
username

Reputation: 65

Filling missing only if "sandwiched" by the same value

I have seen questions for this in R and Python (e.g., NA filling only if "sandwiched" by the same value using dplyr), but can't find an answer of how to do this in Stata.

I would like to fill in missing values in a panel dataset ONLY if it is in between 2 equal non-missing values. For example, if I have

clear 
input score group
2005 1
2006 .
2007 .
2008 .
2009 1
2010 8
2011 1
2012 .
2013 5
2014 3
2015 4
end

the missing observations in 2006, 2007, and 2008 would be filled in to become 1, but the missing observation in 2012 would not be filled in. The missing values are categorical and not ordinal.

I have tried filling in with the prior value if the prior value and next value are equal, but this doesn't account for situations where there are more than 1 missing sandwiched between two equal non-missing values.

Edit: The solution provided below works in some cases but acts unexpectedly in others (I've changed some of the variable names to make it more intuitive and to include two examples in one block). Suppose I start with

clear 
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 .
2009 2 .
2010 2 2
2011 2 2
end

Using the solution, sometimes I get

clear 
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 3
2009 2 3
2010 2 2
2011 2 2
end

For group 1, the non-missings do not fill in when they should. For group 2, they do fill in but should not because the non-missing values at the ends of the sandwich are not equal.

Upvotes: 0

Views: 211

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

Here is one way to do it.

  1. One method of interpolation or imputation is just to carry the last known value forward. In Stata this is discussed at https://www.stata.com/support/faqs/data-management/replacing-missing-values/

  2. Another method is linear interpolation. Linear interpolation will produce the same answer as method 1 if, and only if, the interpolation is flat, so that interpolated values equal known values at each end.

No doubt you could just interpolate and check for constancy, but my guess is that would be about as much code as is given here.


clear 
input score group
2005 1
2006 .
2007 .
2008 .
2009 1
2010 8
2011 1
2012 .
2013 5
2014 3
2015 4
end

clonevar guess1 = group 
replace guess1 = guess1[_n-1] if missing(guess1) 

ipolate group score, gen(guess2)

gen goodguess = guess1 if guess1 == guess2 

list, sepby(goodguess)

     +--------------------------------------------+
     | score   group   guess1   guess2   goodgu~s |
     |--------------------------------------------|
  1. |  2005       1        1        1          1 |
  2. |  2006       .        1        1          1 |
  3. |  2007       .        1        1          1 |
  4. |  2008       .        1        1          1 |
  5. |  2009       1        1        1          1 |
     |--------------------------------------------|
  6. |  2010       8        8        8          8 |
     |--------------------------------------------|
  7. |  2011       1        1        1          1 |
     |--------------------------------------------|
  8. |  2012       .        1        3          . |
     |--------------------------------------------|
  9. |  2013       5        5        5          5 |
     |--------------------------------------------|
 10. |  2014       3        3        3          3 |
     |--------------------------------------------|
 11. |  2015       4        4        4          4 |
     +--------------------------------------------+

You allude to panel data without exemplifying it, but this method can be extended easily to panel data. Use a by: or a by() option as needed.

EDIT The late detail that the variable concerned is categorical not ordinal (meaning, nominal not ordinal; many if not most authorities regard "ordinal" as one kind of categorical; witness Alan Agresti's texts) does not affect the solution here. The copying downwards method will never suggest a different category. ipolate treats data as if linear interpolation makes sense; when it suggests an unacceptable replacement value, that will not be accepted by the code above.

EDIT 2 You are getting results that you don't want because you are applying code for one problem to quite a different problem. Evidently three variables, not two, are involved, contrary to the initial question. In that case, the code should be different.

clear 
input year group id
2005 1 1
2006 1 .
2007 1 .
2008 1 .
2009 1 .
2010 1 .
2011 1 1
2007 2 3
2008 2 .
2009 2 .
2010 2 2
2011 2 2
end

clonevar guess1 = id 
bysort group (year) : replace guess1 = guess1[_n-1] if missing(guess1) 

ipolate id year, gen(guess2) by(group)

gen goodguess = guess1 if guess1 == guess2 

Otherwise put, the initial answer worked from the example given in which the observations were in year order and there was no grouping variable; hence there was only one observation for each distinct year.

When there is other structure (e.g. panel or longitudinal data) you need other code to match it.

Upvotes: 1

Related Questions