ngr
ngr

Reputation: 3

How to improve speed for table with millions of rows

I am new to R and Stack Overflow and am hoping you would be able to help me with a question I have. I wrote the following piece of code:

my = matrix(c(1,1,1,1,1,1,1,1,1,1,
          2,2,2,2,2,2,2,
          0,1,2,3,5,6,7,10,11,14,
          0,1,2,3,4,6,10),ncol = 2, nrow = 17)
colnames(my) = c("ID", "AGE")
my = as.data.frame(my)
my$new = my$ID
system.time(for (i in 1:length(my$ID)) {


ifelse(my$ID[i]==my$ID[i-1],
     ifelse(my$AGE[i]-my$AGE[i-1]==1, my$new[i]<-my$new[i-1],my$new[i]<-my$new[i-1]+0.1),
     my$new[i]<-my$ID[i])
})

It looks at ID and AGE and if AGE is not equal to previous AGE + 1, then it adds 0.1 to ID and keeps this in the column 'new'. Here is the output:

   ID AGE new
1   1   0 1.0
2   1   1 1.0
3   1   2 1.0
4   1   3 1.0
5   1   5 1.1
6   1   6 1.1
7   1   7 1.1
8   1  10 1.2
9   1  11 1.2
10  1  14 1.3
11  2   0 2.0
12  2   1 2.0
13  2   2 2.0
14  2   3 2.0
15  2   4 2.0
16  2   6 2.1
17  2  10 2.2

The problem is that it is really quick for datasets of, say, 1000 rows, but when I try it on my actual dataset, which has more than 8.5m rows, it feels like it would never do it - I tried waiting a few hours with no success.

I would greatly appreciate it if you suggest a way to improve speed/efficiency.

Upvotes: 0

Views: 76

Answers (1)

Bastien
Bastien

Reputation: 166

When dealing with large datasets, you need to try to vectorize your calculations to improve speed.

Here is an example using dplyr package (because of the function lag needed to know the previous value of AGE):

require(dplyr) # for group_by, mutate and %>%
require(microbenchmark) # to compare codes
# Your data
my           <- matrix(c(1,1,1,1,1,1,1,1,1,1,
                         2,2,2,2,2,2,2,
                         0,1,2,3,5,6,7,10,11,14,
                         0,1,2,3,4,6,10), ncol=2, nrow=17)
colnames(my) <- c("ID", "AGE")
my           <- as.data.frame(my)
my$new       <- my$ID
my2          <- my[rep(1:nrow(my), times=100),] # larger dataset

# Your function
f1 <- function(my) {
    for (i in 1:length(my$ID)) {
        ifelse(my$ID[i]==my$ID[i-1],
               ifelse(my$AGE[i]-my$AGE[i-1]==1, my$new[i] <- my$new[i-1],
                                                my$new[i] <- my$new[i-1]+0.1),
               my$new[i] <- my$ID[i])
    }
}

# dplyr function
f2 <- function(my) {
    my %>% group_by(ID) %>% # Work by ID
        mutate(new2=ifelse(is.na(lag(AGE)),  # If lag(AGE) is NA, it's the first record for the ID
                           FALSE,            # Thus, no increase of new3
                           AGE!=lag(AGE)+1), # new3=TRUE if AGE != 1+previous AGE
               new3=new+cumsum(new2)/10)     # increase the decimal if TRUE for the previous new2
}

With a small dataset, the for loop is more efficient, but with large dataset, the improvement is clear:

# Compare codes
microbenchmark(f1(my), f2(my),
               f1(my2), f2(my2))
Unit: milliseconds
    expr        min         lq       mean     median         uq       max neval cld
  f1(my)   1.470699   1.957855   2.750798   2.049954   2.243380  62.34942   100  a 
  f2(my)   2.741614   3.853356   4.235745   4.147085   4.421738  10.84871   100  a 
 f1(my2) 156.986927 215.515605 218.806729 222.390968 228.362988 290.19161   100   b
 f2(my2)   3.398812   4.377638   5.128953   4.659674   5.161190  28.97461   100  a 

Upvotes: 1

Related Questions