Reputation: 311
I have a dataset like this:
tdf <- data.frame(
A = c(1, 2, 5, 4),
B = c(NA, 3, 4, 5),
C = c(NA, NA, NA, NA),
D = c(NA, NA, NA, NA),
E = c(10, 20, 30, 40))
And I need to replace the last non NA value going from left to right, with a value from another column, say E, to give this:
enddf <- data.frame(
A = c(10, 2, 5, 4),
B = c(NA, 20, 30, 40),
C = c(NA, NA, NA, NA),
D = c(NA, NA, NA, NA),
E = c(10, 20, 30, 40))
I'm reading multiple dfs of varying numbers of columns, so I'm thinking a function which reads across all rows but the last one would work?
I tried something like this, but I feel like I'm going the wrong direction:
df <- tdf %>%
rowwise() %>%
mutate(
A = ifelse(is.na(A), A, ifelse(!is.na(B), C, A)),
B = ifelse(!is.na(A) & is.na(B), C, B)
)
Upvotes: 6
Views: 191
Reputation: 270045
Let ix
be the column number before the first NA, one per row. If there are no NA's then return 1 via the nomatch
argument of match
which becomes 0 after subtracting 1. In the first test below ix
is c(1, 2, 2, 2) and in the second is c(1, 2, 2, 0, 0)
. Then insert tdf$E
into successive rows such that the first element of ix
corresponds to row 1, the second to row 2 and so on. An ix
element of 0 means skip the insertion for that row so that that row remains unchanged.
(The code works whether data
is a data frame or matrix but if it were known to always be a matrix then [ix>0, ]
in the second line could optionally be omitted due to a subtle difference between matrix and data frame indexing.)
This solution has a number of advantages:
Code
fillin <- function(data, col = ncol(data)) {
ix <- apply(data, 1, match, x = NA, nomatch = 1) - 1
replace(data, cbind(seq_along(ix), ix)[ix>0,], data[ix>0, col])
}
# tests
identical(fillin(tdf), enddf) # inputs from question
## [1] TRUE
identical(fillin(tdf2), enddf2) # inputs from Note below
## [1] TRUE
Input used
tdf2 <- data.frame(
A = c(1, 2, 5, NA,1),
B = c(NA, 3, 4, NA,1),
C = c(NA, NA, NA, NA,1),
D = c(NA, NA, NA, NA,1),
E = c(10, 20, 30, 40,1))
enddf2 <- data.frame(
A = c(10, 2, 5, NA, 1),
B = c(NA, 20, 30, NA, 1),
C = c(NA, NA, NA, NA, 1),
D = c(NA, NA, NA, NA, 1),
E = c(10, 20, 30, 40, 1))
Removed stray period in code and enhanced to support both the test example in the question and the new one in the comment under this question. Also made a number of other small changes.
Upvotes: -1
Reputation: 6529
You could solve your problem as follows:
ij = which(is.na(tdf), arr.ind=TRUE)
ij = ij[!duplicated(ij[,1]) & ij[,2]>1,]
ij[, 2] = ij[, 2]-1
tdf[ij] = tdf$E[ij[,1]]
Upvotes: 0
Reputation: 76641
A run-length encoding rle
can give you the last non-NA
on each row. Add 1 and use as the column number in an index matrix.
tdf <- data.frame(
A = c(1, 2, 5, 4),
B = c(NA, 3, 4, 5),
C = c(NA, NA, NA, NA),
D = c(NA, NA, NA, NA),
E = c(10, 20, 30, 40))
inx <- apply(tdf, 1L, \(x) {
i <- rle(is.na(x))$lengths[1L]
ifelse(i < nrow(tdf), i, NA_integer_)
})
inx <- cbind(seq_len(nrow(tdf)), inx)
tdf[inx] <- tdf$E
tdf
#> A B C D E
#> 1 10 NA NA NA 10
#> 2 2 20 NA NA 20
#> 3 5 30 NA NA 30
#> 4 4 40 NA NA 40
Created on 2025-02-21 with reprex v2.1.1
Upvotes: 4
Reputation: 40171
One tidyverse
option could be:
tdf %>%
mutate(pmap_dfr(across(c(A:D, E)),
~ `[<-`(c(...), max(which(!is.na(head(c(...), -1)))), tail(c(...), 1))))
A B C D E
1 10 NA NA NA 10
2 2 20 NA NA 20
3 5 30 NA NA 30
4 4 40 NA NA 40
Upvotes: 2
Reputation: 102529
You can use max.col
+ is.na
like below
> tdf[cbind(1:nrow(tdf), max.col(is.na(tdf), "first") - 1)] <- tdf$E
> tdf
A B C D E
1 10 NA NA NA 10
2 2 20 NA NA 20
3 5 30 NA NA 30
4 4 40 NA NA 40
Upvotes: 4
Reputation: 19191
A base R approach that looks for NA
row-wise with apply
.
wcol <- ncol(tdf)
data.frame(t(apply(tdf, 1, \(x) {
x[which(is.na(x))[1] - 1] <- x[wcol]
x
})))
output
A B C D E
1 10 NA NA NA 10
2 2 20 NA NA 20
3 5 30 NA NA 30
4 4 40 NA NA 40
Since the values are all numeric no conversion occurs with the intermediate matrix result.
Upvotes: 6