Making my code more efficient in R

Question

I am trying to execute a code that takes way too much time (>6 days). Maybe there is a way of making it more efficient. Any ideas?

library(haven)
library(plyr)
AFILIAD1 <- read_sav("XXXX")
#this sav has around 6 million rows.

AFILIAD1$F_ALTA<- as.character(AFILIAD1$F_ALTA)
AFILIAD1$F_BAJA<- as.character(AFILIAD1$F_BAJA)


AFILIAD1$F_ALTA <- as.Date(AFILIAD1$F_ALTA, "%Y%m%d")
AFILIAD1$F_BAJA <- as.Date(AFILIAD1$F_BAJA, "%Y%m%d")
#starting and ending date

meses <- seq(as.Date("1900-01-01"), as.Date("2014-12-31"), by = "month")

#this is the function that needs to be more efficient 
ocupados <- function(pruebas){
 previo <- c()
 total <- c()
   for( i in 1:length(meses)){
     for( j in 1:nrow(pruebas)){
       ifelse(pruebas$F_ALTA[j] <= meses[i]  & pruebas$F_BAJA[j] >= 
       meses[i], previo[j]<- pruebas$IPF[j],previo[j]<- NA)
      }
    total[i] <- (length(unique(previo))-1)
   }
  names(total)<-meses
  return(total)
}

#this takes >6 days to execute
afiliado1 <- ocupados(AFILIAD1)

Melissa Key · Accepted Answer

There is a lot you can do to speed this up. Here's one example:

library(tidyverse) % adds pipes
ocupados <- function(pruebas) {
  total <- map_int(meses, function(x) {
    with(pruebas, {
      IPF[F_ALTA <= x & F_BAJA >= x] %>%
        n_distinct() #I'm assuming you subtract 1 to remove the NA effect - no longer needed
    })
  })
  names(total) <- meses
  return(total)
}

There are two big speed ups here. First, the inner loop is implemented in compiled code (so you don't see it here), which will be huge savings for you.
Second, we never define empty vectors. Those empty vectors have to be copied EVERY time you increase the length - which is very expensive. Instead, all I'm saving is the final result. The apply family of functions behave like loops, but implement the code in a function.

If you're not familiar with the pipe operator (%>%), all it does is call the next function with the result from the previous function as the next argument. So

length(unique(x))

is the same as

x %>%
  unique() %>%
  length()

The advantage is readability - it's easier to see that I apply unique, then length using the pipe.

One more comment - without a reproducible example, I cannot test this code. If you have trouble, you need to include a small reproducible data set so we can actually test what the code is doing.

Making my code more efficient in R

Answers (1)

Related Questions