Jay khan
Jay khan

Reputation: 745

How to fill missing or empty values based on previous and future values?

Suppose I have following dataset

ID A 
1  0
1  0
1  0
1  X
2  X
2  X
2  0
2  X
3  X
3  0
3  X
3  Y
4  X
4  X
4  0
4  Y

I want to fill this dataset (variable A) based on past and future values of A. For example for ID= 1 first three values will become X too based on future value. For ID=2, missing value will become X because past and future values are also X. For ID = 3 it will also become X even in future there is another value Y but its adjacent values are X so it will become X. For ID= 4 we will base missing values only on past.

Basically, the algorithm should look at the past values first; if no past value is available then it should use the future values.

I know about filling the missing values based on mean and median but as variable A is not numeric and the criterion keeps on changing at each step, how should I tackle this problem?

Upvotes: 1

Views: 2366

Answers (1)

akrun
akrun

Reputation: 887223

We can use na.locf from zoo to fill the missing values with the previous values. To do that, we convert the '0' values to 'NA'. Using data.table, we can convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID' column, we fill the missing values with previous non-missing value using na.locf and the na.rm=FALSE option. Then, we can again run na.locf with fromLast=TRUE option to cause observations to be carried backward and fill the remaining NAs in 'A' column.

library(zoo)
library(data.table)
is.na(df1$A) <- df1$A==0
setDT(df1)[,A := na.locf(na.locf(A, na.rm=FALSE), fromLast=TRUE) , by = ID]
df1
#    ID A
# 1:  1 X
# 2:  1 X
# 3:  1 X
# 4:  1 X
# 5:  2 X
# 6:  2 X
# 7:  2 X
# 8:  2 X
# 9:  3 X
#10:  3 X
#11:  3 X
#12:  3 Y
#13:  4 X
#14:  4 X
#15:  4 X
#16:  4 Y

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L), A = c("0", "0", "0", "X", "X", "X", 
"0", "X", "X", "0", "X", "Y", "X", "X", "0", "Y")), .Names = c("ID", 
"A"), class = "data.frame", row.names = c(NA, -16L))

Upvotes: 2

Related Questions