Reputation: 2763

Using nested apply functions instead of nested for loops

My objective here was to iterate across each column in a df and then for each column iterate down each row and perform a function. The specific function in this case replaces the NA values with the corresponding value in the final column, but the details of the function required are not relevant to the question here. I got the results I needed using two nested for loops like this:

for (j in 1:ncol(df.i)) {
  for (i in 1:nrow(df.i)) {
    df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
  }
}

However, I believe this should be possible using an apply(df.i, 1, function) nested within an apply(df.i, 2, function) But I'm not totally sure that is possible or how to do it. Does anyone know how to achieve the same thing with a nested use of the apply function?

Upvotes: 0

Answers (1)

Rui Barradas

Reputation: 76651

Here are four ways to do what the inner instruction does.

First, a dataset example.

set.seed(5345)    # Make the results reproducible
df.i <- matrix(1:400, ncol = 40)
is.na(df.i) <- sample(400, 50)

Now, the comment by @Dave2e: just one for loop, vectorize the inner most one.

df.i2 <- df.i3 <- df.i1 <- df.i    # Work with copies

for (j in 1:ncol(df.i1)) {
  df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
}

Then, vectorized, no loops at all.

df.i2 <- ifelse(is.na(df.i), df.i[, 39], df.i)

Another vectorized, by @Gregor in the comment, much better since ifelse is known to be relatively slow.

df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]

And your solution, as posted in the question.

for (j in 1:ncol(df.i)) {
  for (i in 1:nrow(df.i)) {
    df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
  }
}

Compare the results.

identical(df.i, df.i1)
#[1] TRUE

identical(df.i, df.i2)
#[1] TRUE

identical(df.i, df.i3)
#[1] TRUE

Benchmarks.

After the comment by @Gregor I have decided to benchmark the 4 solutions. As expected each optimization gives a significant seep up and his fully vectorized solution is the fastest.

f <- function(df.i){
  for (j in 1:ncol(df.i)) {
    for (i in 1:nrow(df.i)) {
      df.i[i,j] <- ifelse(is.na(df.i[i,j]), df.i[i,39], df.i[i,j])
    }
  }
  df.i
}

f1 <- function(df.i1){
  for (j in 1:ncol(df.i1)) {
    df.i1[,j] <- ifelse(is.na(df.i1[, j]), df.i1[, 39], df.i1[, j])
  }
  df.i1
}

f2 <- function(df.i2){
  df.i2 <- ifelse(is.na(df.i2), df.i2[, 39], df.i2)
  df.i2
}

f3 <- function(df.i3){
  df.i3[is.na(df.i3)] <- df.i3[row(df.i3)[is.na(df.i3)], 39]
  df.i3
}

microbenchmark::microbenchmark(
  two_loops = f(df.i),
  one_loop = f1(df.i1),
  ifelse = f2(df.i2),
  vectorized = f3(df.i3)
)
#Unit: microseconds
#      expr      min        lq       mean    median       uq      max neval
# two_loops 1125.017 1143.4995 1226.93089 1152.5665 1190.599 5209.431   100
#  one_loop  492.945  500.7045  518.73060  504.9435  516.638  678.951   100
#    ifelse   42.269   45.7770   50.55519   48.4140   50.470  198.533   100
#vectorized   12.626   14.5520   16.21975   15.6380   17.663   27.525   100

Upvotes: 2

Using nested apply functions instead of nested for loops

Answers (1)

Related Questions