How to loop through columns of a data.frame and use a function

Question

This has probably been answered already and in that case, I am sorry to repeat the question, but unfortunately, I couldn't find an answer to my problem. I am currently trying to work on the readability of my code and trying to use functions more frequently, yet I am not that familiar with it.

I have a data.frame and some columns contain NA's that I want to interpolate with, in this case, a simple kalman filter.

require(imputeTS)

#some test data
col <- c("Temp","Prec")
df_a <- data.frame(c(10,13,NA,14,17),
                   c(20,NA,30,NA,NA))
names(df_a) <- col

#this is my function I'd like to use
gapfilling <- function(df,col){
  print(sum(is.na(df[,col])))
  df[,col] <- na_kalman(df[,col])
}

#this is my for-loop to loop through the columns
for (i in col) {
  gapfilling(df_a, i)
}

I have two problems:

My for loop works, yet it doesn't overwrite the data.frame. Why?
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.

Oliver · Accepted Answer

How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.

You most definitely do not have to avoid for loops. What you should avoid is using a loop to perform actions that could be vectorized. Loops are in general just fine, however they are (much) slower compared to compiled languages such as c++, but are equivalent to loops in languages such as python.

My for loop works, yet it doesn't overwrite the data.frame. Why?

This is a problem with overwriting values within a function, or what is referred to as scope. Basically any assignment is restricted to its current environment (or scope). Take the example below:

f <- function(x){
    a <- x
    cat("a is equal to ", a, "
")
    return(3)
}
x <- 4
f(x)
a is equal to  4 
[1] 3
print(a)

Error in print(a) : object 'a' not found

As you can see, "a" definitely exists, but it stops existing after the function call has been fulfilled. It is restricted to the environment (or scope) of the function. Here the scope is basically the time at which the function is run.

To alleviate this, you have to overwrite the value in the global environment

for (i in col) {
  df_a[, i] <- gapfilling(df_a, i)
}

Now for readability (not speed) one could change this to a lapply

df_a[, col] <- lapply(df_a[, col], na_kalman)

I set a heavy point on it not being faster than using a loop. lapply iterates over each column, as you would in a loop. Speed could be obtained if say na_kalman was programmed to take multiple columns, and possibly save time using optimized c or c++ code.

How to loop through columns of a data.frame and use a function

Answers (1)

Related Questions