Reputation: 2763
My objective here was to iterate across each column in a df
and then for each column iterate down each row and perform a function. The specific function in this case replaces the NA
values with the corresponding value in the final column, but the details of the function required are not relevant to the question here. I got the results I needed using two nested for loops like this:
for (j in 1:ncol(df.i)) {
for (i in 1:nrow(df.i)) {
df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
}
}
However, I believe this should be possible using an apply(df.i, 1, function)
nested within an apply(df.i, 2, function)
But I'm not totally sure that is possible or how to do it. Does anyone know how to achieve the same thing with a nested use of the apply
function?
Upvotes: 0
Views: 256
Reputation: 76651
Here are four ways to do what the inner instruction does.
First, a dataset example.
set.seed(5345) # Make the results reproducible
df.i <- matrix(1:400, ncol = 40)
is.na(df.i) <- sample(400, 50)
Now, the comment by @Dave2e: just one for
loop, vectorize the inner most one.
df.i2 <- df.i3 <- df.i1 <- df.i # Work with copies
for (j in 1:ncol(df.i1)) {
df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}
Then, vectorized, no loops at all.
df.i2 <- ifelse(is.na(df.i), df.i[, 39], df.i)
Another vectorized, by @Gregor in the comment, much better since ifelse
is known to be relatively slow.
df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
And your solution, as posted in the question.
for (j in 1:ncol(df.i)) {
for (i in 1:nrow(df.i)) {
df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
}
}
Compare the results.
identical(df.i, df.i1)
#[1] TRUE
identical(df.i, df.i2)
#[1] TRUE
identical(df.i, df.i3)
#[1] TRUE
Benchmarks.
After the comment by @Gregor I have decided to benchmark the 4 solutions. As expected each optimization gives a significant seep up and his fully vectorized solution is the fastest.
f <- function(df.i){
for (j in 1:ncol(df.i)) {
for (i in 1:nrow(df.i)) {
df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
}
}
df.i
}
f1 <- function(df.i1){
for (j in 1:ncol(df.i1)) {
df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}
df.i1
}
f2 <- function(df.i2){
df.i2 <- ifelse(is.na(df.i2), df.i2[, 39], df.i2)
df.i2
}
f3 <- function(df.i3){
df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
df.i3
}
microbenchmark::microbenchmark(
two_loops = f(df.i),
one_loop = f1(df.i1),
ifelse = f2(df.i2),
vectorized = f3(df.i3)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# two_loops 1125.017 1143.4995 1226.93089 1152.5665 1190.599 5209.431 100
# one_loop 492.945 500.7045 518.73060 504.9435 516.638 678.951 100
# ifelse 42.269 45.7770 50.55519 48.4140 50.470 198.533 100
#vectorized 12.626 14.5520 16.21975 15.6380 17.663 27.525 100
Upvotes: 2