Matthew
Matthew

Reputation: 2677

Looping over rows in a dataframe

Suppose I need to loop over the rows in a data frame for some reason.

I create a simple data.frame

df <- data.frame(id = sample(1e6, 1e7, replace = TRUE))

It seems that f2 is much slower than f1, while I expected them to be equivalent.

f1 <- function(v){
        for (obs in 1:(1e6) ){
            a <- v[obs] 
        }
        a
    }
system.time(f1(df$id))

f2 <- function(){
        for (obs in 1:(1e6) ){
            a <- df$id[obs] 
        }
    a
    }
system.time(f2())

Would you know why? Do they use exactly the same amount of memory?

Upvotes: 8

Views: 500

Answers (2)

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162341

If you instead write your timings like this and recognize that df$x is really a function call (to `$`(df,x)) the mystery disappears:

system.time(for(i in 1:1e6) df$x)
#    user  system elapsed 
#    8.52    0.00    8.53 
system.time(for(i in 1) df$x)
#    user  system elapsed 
#       0       0       0 

Upvotes: 7

Gregor Thomas
Gregor Thomas

Reputation: 145805

In f1, you bypass the data frame entirely by just passing a vector to your function. So your code is essentially "I have a vector! This is the first element. This is the second element. This is the third..."

By contrast, in f2, you give it a whole data frame and then get the each element of a single column each time. So your code is "I have a data frame. This is the first element of the ID column. This is the second element of the ID column. This is the third..."

It's much faster if you extract the simple data structure (vector) once, and then can only work with that, rather than repeatedly extracting the simple structure from the larger object.

Upvotes: 3

Related Questions