Reputation: 745
Suppose I have following dataset
ID A
1 0
1 0
1 0
1 X
2 X
2 X
2 0
2 X
3 X
3 0
3 X
3 Y
4 X
4 X
4 0
4 Y
I want to fill this dataset (variable A) based on past and future values of A.
For example for ID= 1
first three values will become X too based on future value. For ID=2
, missing value will become X because past and future values are also X. For ID = 3
it will also become X even in future there is another value Y but its adjacent values are X so it will become X. For ID= 4
we will base missing values only on past.
Basically, the algorithm should look at the past values first; if no past value is available then it should use the future values.
I know about filling the missing values based on mean and median but as variable A is not numeric and the criterion keeps on changing at each step, how should I tackle this problem?
Upvotes: 1
Views: 2366
Reputation: 887223
We can use na.locf
from zoo
to fill the missing values with the previous values. To do that, we convert the '0' values to 'NA'. Using data.table
, we can convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'ID' column, we fill the missing values with previous non-missing value using na.locf
and the na.rm=FALSE
option. Then, we can again run na.locf
with fromLast=TRUE
option to cause observations to be carried backward and fill the remaining NAs
in 'A' column.
library(zoo)
library(data.table)
is.na(df1$A) <- df1$A==0
setDT(df1)[,A := na.locf(na.locf(A, na.rm=FALSE), fromLast=TRUE) , by = ID]
df1
# ID A
# 1: 1 X
# 2: 1 X
# 3: 1 X
# 4: 1 X
# 5: 2 X
# 6: 2 X
# 7: 2 X
# 8: 2 X
# 9: 3 X
#10: 3 X
#11: 3 X
#12: 3 Y
#13: 4 X
#14: 4 X
#15: 4 X
#16: 4 Y
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), A = c("0", "0", "0", "X", "X", "X",
"0", "X", "X", "0", "X", "Y", "X", "X", "0", "Y")), .Names = c("ID",
"A"), class = "data.frame", row.names = c(NA, -16L))
Upvotes: 2