Reputation: 21
I guess that other people have already looked for it but couldn't find what I'm looking for.
I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table
solution.
Right now, I've managed to do it only with a (very inefficient) loop.
In addition, my current code does not replace NA in case that there are two NA's in the same row.
I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?
ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(NA, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
desired_ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(2, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
print(r)
to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
if (length(to_check) == 0) {
print("no NA- next")
next
}
for (col_check in to_check) {
.ex <- copy(ex)[seq(from = r, to = r + 1), ]
.ex[[col_check]] <- NULL
if (nrow(unique(.ex)) == 1) {
ex[[col_check]][r] <- ex[[col_check]][r + 1]
}
}
}
all.equal(ex, desired_ex)
Upvotes: 0
Views: 234
Reputation: 42582
Here is a solution which will work for an arbitrary number of rows and columns within each id
not just pairs of rows:
library(data.table)
ex[,
if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N))
} else {
.SD
}, by = id]
or, more compact,
ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id attr1 attr2 attr3 attr4 1: 1 NA 2 2 1 2: 1 NA 2 2 1 3: 2 3 NA 2 1 4: 2 3 3 1 3
For each id
it is checked if the rows fulfill the condition. If not .SD
is returned unchanged. If the condition is fulfilled a new .SD
is created by picking the first non-NA value in each column (or NA
in case of all NA
) using fcoalesce()
and replicating this value as many times as there are rows in .SD
.
The check for the condition consists of 2 parts. First, it is checked for each column in .SD
if all values are identical thereby ignoring any NA
. Finally, it is checked if this is TRUE
for all columns.
Note that .SD
is a data.table
containing the Subset of Data for each group, excluding any columns used in by
.
ex2 <- fread("
id foo bar baz attr4 attr5
1 NA 2 NA 1 5
1 NA 2 2 1 NA
1 NA 2 NA NA NA
2 3 NA 2 1 2
2 3 3 1 3 2
2 3 3 1 4 2
3 5 2 NA 1 3
3 NA 2 2 1 3
4 NA NA NA NA NA
")
ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id foo bar baz attr4 attr5 1: 1 NA 2 2 1 5 2: 1 NA 2 2 1 5 3: 1 NA 2 2 1 5 4: 2 3 NA 2 1 2 5: 2 3 3 1 3 2 6: 2 3 3 1 4 2 7: 3 5 2 2 1 3 8: 3 5 2 2 1 3 9: 4 NA NA NA NA NA
Upvotes: 1
Reputation: 25223
Here is an option mixing base R with data.table:
#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]
#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) &
apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))
#update those rows with values from row below
ex[(flags), (cols) :=
mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]
Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.
Upvotes: 0