Pham Cong Minh
Pham Cong Minh

Reputation: 125

Replace NA with mean of adjacent values

I want to replace NA value with mean of adjacent non-missing values in "return" column, grouped by "id". Let assume that there are only two months: 1,2 in a year.

df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
                 year = c(2014,2014,2015,2015),
                 month = c(1, 2),
                 marketcap = c(4,6,2,6,23,2,5,34),
                 return = c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))

df1
   id year month marketcap return
1:  A 2014     1         4     NA # <-
2:  A 2014     2         6   0.23
3:  A 2015     1         2   0.20
4:  A 2015     2         6   0.10
5:  B 2014     1        23   0.40
6:  B 2014     2         2   0.90
7:  B 2015     1         5     NA # <-
8:  B 2015     2        34   0.60

Desired data

desired_df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
                         year = c(2014,2014,2015,2015),
                         month = c(1,2),
                         marketcap = c(4,6,2,6,23,2,5,34),
                         return = c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6))

desired_df
  id year month marketcap return
1  A 2014     1         4   0.23 # <-
2  A 2014     2         6   0.23
3  A 2015     1         2   0.20
4  A 2015     2         6   0.10
5  B 2014     1        23   0.40
6  B 2014     2         2   0.90
7  B 2015     1         5   0.75 # <-
8  B 2015     2        34   0.60

The second NA (row 7) should be replaced by the mean of the values before and after, i.e. (0.9 + 0.6)/2 = 0.75.

Note that the first NA (row 1), has no previous data. Here NA should be replaced with the next non-missing value, 0.23 ("last observation carried backwards").

A data.table solution is preferred if it is possible

UPDATE: When use the code structure as follows (which works for the sample)

df[,returnInterpolate:=na.approx(return,rule=2), by=id]

I have encountered the error: Error in approx(x[!na], y[!na], xout, ...) : need at least two non-NA values to interpolate

I guess that may be there is some id that have no non-NA values to interpolate. .any suggestions?

Upvotes: 3

Views: 1515

Answers (2)

Steffen Moritz
Steffen Moritz

Reputation: 7730

The easy imputeTS solution without caring for the ID would be:

library("imputeTS")
na.interpolate(df)

Since the imputation should be according to ID, it is a little bit more complicated - since it seems often there are not enough values left when filtered by ID. I would take the solution Roland posted and use imputeTS::na.interpolation() where possible and in the other cases maybe the overall mean with imputeTS::na.mean() or a random guess in the overall bounds imputeTS::na.random() could be used.

In this case it might also be a very good idea to look beyond univariate time series interpolation / imputation. There are a lot of other variables that could help estimating the missing values (if there is a correlation). Packages like AMELIA could help here.

Upvotes: 0

Roland
Roland

Reputation: 132706

library(data.table)
df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
                 year=c(2014,2014,2015,2015),
                 month=c(1,2),
                 marketcap=c(4,6,2,6,23,2,5,34),
                 return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
setDT(df)
library(zoo)
df[, returnInterpol := na.approx(return, rule = 2), by = id]
#   id year month marketcap return returnInterpol
#1:  A 2014     1         4     NA           0.23
#2:  A 2014     2         6   0.23           0.23
#3:  A 2015     1         2   0.20           0.20
#4:  A 2015     2         6   0.10           0.10
#5:  B 2014     1        23   0.40           0.40
#6:  B 2014     2         2   0.90           0.90
#7:  B 2015     1         5     NA           0.75
#8:  B 2015     2        34   0.60           0.60

Edit:

If you have groups with only NA values or only one non-NA, you could do this:

df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"),
                 year=c(2014,2014,2015,2015),
                 month=c(1,2),
                 marketcap=c(4,6,2,6,23,2,5,34, 1:4),
                 return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA))
setDT(df)
df[, returnInterpol := switch(as.character(sum(!is.na(return))),
                              "0" = return,
                              "1" = {na.omit(return)},  
                              na.approx(return, rule = 2)), by = id]

#     id year month marketcap return returnInterpol
#  1:  A 2014     1         4     NA           0.23
#  2:  A 2014     2         6   0.23           0.23
#  3:  A 2015     1         2   0.20           0.20
#  4:  A 2015     2         6   0.10           0.10
#  5:  B 2014     1        23   0.40           0.40
#  6:  B 2014     2         2   0.90           0.90
#  7:  B 2015     1         5     NA           0.75
#  8:  B 2015     2        34   0.60           0.60
#  9:  C 2014     1         1     NA           0.30
# 10:  C 2014     2         2     NA           0.30
# 11:  C 2015     1         3   0.30           0.30
# 12:  C 2015     2         4     NA           0.30

Upvotes: 6

Related Questions