Reputation: 21

replace missing values using other rows only when other columns are the same in R

I guess that other people have already looked for it but couldn't find what I'm looking for.

I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.

Right now, I've managed to do it only with a (very inefficient) loop.

In addition, my current code does not replace NA in case that there are two NA's in the same row.

I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?

ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(NA, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

desired_ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(2, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
    print(r)
    to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
    if (length(to_check) == 0) {
        print("no NA- next")
        next
    }
    
    for (col_check in to_check) {
        .ex <- copy(ex)[seq(from = r, to = r + 1), ]
        .ex[[col_check]] <- NULL
        if (nrow(unique(.ex)) == 1) {
            ex[[col_check]][r] <- ex[[col_check]][r + 1]
        }
    }
}

all.equal(ex, desired_ex)

Upvotes: 0

Answers (2)

Uwe

Reputation: 42582

Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:

library(data.table)
ex[,  
   if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
     lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) 
   } else {
     .SD
   }, by  = id]

or, more compact,

ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]

   id attr1 attr2 attr3 attr4
1:  1    NA     2     2     1
2:  1    NA     2     2     1
3:  2     3    NA     2     1
4:  2     3     3     1     3

Explanation

For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.

The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.

Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.

Another use case with more rows and columns

ex2 <- fread("
 id   foo   bar   baz attr4 attr5
  1    NA     2    NA     1     5
  1    NA     2     2     1    NA
  1    NA     2    NA    NA    NA
  2     3    NA     2     1     2
  2     3     3     1     3     2
  2     3     3     1     4     2
  3     5     2    NA     1     3
  3    NA     2     2     1     3
  4    NA    NA    NA    NA    NA
")

ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD)) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]

   id foo bar baz attr4 attr5
1:  1  NA   2   2     1     5
2:  1  NA   2   2     1     5
3:  1  NA   2   2     1     5
4:  2   3  NA   2     1     2
5:  2   3   3   1     3     2
6:  2   3   3   1     4     2
7:  3   5   2   2     1     3
8:  3   5   2   2     1     3
9:  4  NA  NA  NA    NA    NA

Upvotes: 1

chinsoon12

Reputation: 25223

Here is an option mixing base R with data.table:

#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]

#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) & 
    apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))

#update those rows with values from row below
ex[(flags), (cols) := 
    mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]

Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

Upvotes: 0

replace missing values using other rows only when other columns are the same in R

Answers (2)

Explanation

Another use case with more rows and columns

Related Questions