Reputation: 125
I want to replace NA
value with mean of adjacent non-missing values in "return" column, grouped by "id". Let assume that there are only two months: 1,2 in a year.
df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1, 2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
df1
id year month marketcap return
1: A 2014 1 4 NA # <-
2: A 2014 2 6 0.23
3: A 2015 1 2 0.20
4: A 2015 2 6 0.10
5: B 2014 1 23 0.40
6: B 2014 2 2 0.90
7: B 2015 1 5 NA # <-
8: B 2015 2 34 0.60
Desired data
desired_df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1,2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6))
desired_df
id year month marketcap return
1 A 2014 1 4 0.23 # <-
2 A 2014 2 6 0.23
3 A 2015 1 2 0.20
4 A 2015 2 6 0.10
5 B 2014 1 23 0.40
6 B 2014 2 2 0.90
7 B 2015 1 5 0.75 # <-
8 B 2015 2 34 0.60
The second NA
(row 7) should be replaced by the mean of the values before and after, i.e. (0.9 + 0.6)/2 = 0.75.
Note that the first NA
(row 1), has no previous data. Here NA
should be replaced with the next non-missing value, 0.23 ("last observation carried backwards").
A data.table
solution is preferred if it is possible
UPDATE: When use the code structure as follows (which works for the sample)
df[,returnInterpolate:=na.approx(return,rule=2), by=id]
I have encountered the error: Error in approx(x[!na], y[!na], xout, ...) : need at least two non-NA values to interpolate
I guess that may be there is some id that have no non-NA values to interpolate. .any suggestions?
Upvotes: 3
Views: 1515
Reputation: 7730
The easy imputeTS
solution without caring for the ID would be:
library("imputeTS")
na.interpolate(df)
Since the imputation should be according to ID, it is a little bit more complicated - since it seems often there are not enough values left when filtered by ID. I would take the solution Roland posted and use imputeTS::na.interpolation()
where possible and in the other cases maybe the overall mean with imputeTS::na.mean()
or a random guess in the overall bounds imputeTS::na.random()
could be used.
In this case it might also be a very good idea to look beyond univariate time series interpolation / imputation. There are a lot of other variables that could help estimating the missing values (if there is a correlation). Packages like AMELIA
could help here.
Upvotes: 0
Reputation: 132706
library(data.table)
df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
setDT(df)
library(zoo)
df[, returnInterpol := na.approx(return, rule = 2), by = id]
# id year month marketcap return returnInterpol
#1: A 2014 1 4 NA 0.23
#2: A 2014 2 6 0.23 0.23
#3: A 2015 1 2 0.20 0.20
#4: A 2015 2 6 0.10 0.10
#5: B 2014 1 23 0.40 0.40
#6: B 2014 2 2 0.90 0.90
#7: B 2015 1 5 NA 0.75
#8: B 2015 2 34 0.60 0.60
Edit:
If you have groups with only NA
values or only one non-NA
, you could do this:
df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34, 1:4),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA))
setDT(df)
df[, returnInterpol := switch(as.character(sum(!is.na(return))),
"0" = return,
"1" = {na.omit(return)},
na.approx(return, rule = 2)), by = id]
# id year month marketcap return returnInterpol
# 1: A 2014 1 4 NA 0.23
# 2: A 2014 2 6 0.23 0.23
# 3: A 2015 1 2 0.20 0.20
# 4: A 2015 2 6 0.10 0.10
# 5: B 2014 1 23 0.40 0.40
# 6: B 2014 2 2 0.90 0.90
# 7: B 2015 1 5 NA 0.75
# 8: B 2015 2 34 0.60 0.60
# 9: C 2014 1 1 NA 0.30
# 10: C 2014 2 2 NA 0.30
# 11: C 2015 1 3 0.30 0.30
# 12: C 2015 2 4 NA 0.30
Upvotes: 6