Carl H
Carl H

Reputation: 1036

replace missing value based on linear prediction of nearby cells

I have a dataset (tsset) that has observations in some years but not others:

year x
1990 600
1991 .
1992 .
1993 .
1994 .
1995 1100
1996 .
1997 .
1998 1700

Suppose I am willing to make the assumption that every missing observation between two non-missing years (say 1990 and 1995 for example) can be imputed by a linear prediction between the said non-missing years, which makes the data like

year  x
1990  600
1991 [700]
1992 [800]
1993 [900]
1994 [1000]
1995  1100
1996 [1300]
1997 [1500]
1998  1700

Is there anyway to do this efficiently? I am currently using something like cond(year>1990 & year <1995, [Value if True], [Value if False]), but I do not know a good way to for all years from 1991 to 1994 to find 1990 as their lower bound and 1995 as the upper bound.

Stata's documentation demonstrates the technique of using x[_n-1] if I simply want to fill missing values from the previous cell, but not sure how this can be extended to solve my problem as described above.

Upvotes: 1

Views: 1571

Answers (1)

Nick Cox
Nick Cox

Reputation: 37358

What you ask for is linear interpolation. ipolate to do it has been a command in Stata for most of its history. No loops are entailed.

clear 
input year x
1990 600
1991 .
1992 .
1993 .
1994 .
1995 1100
1996 .
1997 .
1998 1700
end 
ipolate x year, gen(xint) 
list , sep(0)

     +--------------------+
     | year      x   xint |
     |--------------------|
  1. | 1990    600    600 |
  2. | 1991      .    700 |
  3. | 1992      .    800 |
  4. | 1993      .    900 |
  5. | 1994      .   1000 |
  6. | 1995   1100   1100 |
  7. | 1996      .   1300 |
  8. | 1997      .   1500 |
  9. | 1998   1700   1700 |
     +--------------------+

Note that the original variable remains intact, which is prudent as a matter of an analysis audit trail.

ipolate extends to interpolation done separately within distinct groups, most commonly in practice panel or longitudinal data with different panels (people, firms, countries, stations, sites, whatever) with distinct identifiers followed over time.

There are naturally many other kinds of interpolation.

mipolate (SSC) is a user-written program that generalizes ipolate. See here for a discussion or just install it with ssc install mipolate and read its help.

Upvotes: 2

Related Questions