Optimizing a slow for-loop operation

Question

I am trying to make a column containing full texts by iterating over elements in few columns of the data.table. here is my current approach. It is working as I intended however, when the data.table gets big, I get immense time waste.

library(data.table)

new_df <- data.table(text= c("RT A y...", "RT b...", "XYZ 3...", "RT Ca...", "IO"),
                     full_text= c(NA, NA, "XYZ 378978978", NA, NA),
                     status.text= c("A yes y...", "ball ball", NA, "Call ca...", NA),
                     status.full_text= c("A yes yes yes yes", NA, NA, "Call call call", NA))

#     text     full_text status.text  status.full_text
# 1: RT A y...            A yes y... A yes yes yes yes
# 2:   RT b...             ball ball              
# 3:  XYZ 3... XYZ 378978978                      
# 4:  RT Ca...            Call ca...    Call call call
# 5:        IO                                
#   

attach_texts_in_df <- function(give_me_df){
  
  #make an empty vector to store texts
  complete_texts <- c()
  
  #loop through each elements of rows
  for(i in seq_along(1:nrow(give_me_df))){
    
    #check if text begins with RT
    if(!grepl('^RT', give_me_df[i, "text"])){
      #check if text is smaller than the full_text, while full text is not NA
      if((nchar(give_me_df[i, "text"]) < nchar(give_me_df[i, "full_text"]))& !is.na(give_me_df[i, "full_text"])){
        complete_texts <- c(complete_texts, give_me_df[i, "full_text"])
        
      }else{
        complete_texts <- c(complete_texts, give_me_df[i, "text"]) # if not, then it's original
      }
      
    }
    else{
      
      if((nchar(give_me_df[i, "status.text"]) < nchar(give_me_df[i, "status.full_text"]))& !is.na(give_me_df[i, "status.full_text"])){
        complete_texts <- c(complete_texts, give_me_df[i, "status.full_text"])
        
      }else{
        complete_texts <- c(complete_texts, give_me_df[i, "status.text"])
      }
      
    }
  }
  
  #attached the proper texts
  give_me_df$complete_text <- complete_texts
  
  #return the vector
  return(give_me_df)
}

new_df <- attach_texts_in_df(new_df)

#this was the what I was looking for and I got it when its small, but big one take a long time!!
#     text     full_text status.text  status.full_text     complete_text
# 1: RT A y...            A yes y... A yes yes yes yes A yes yes yes yes
# 2:   RT b...             ball ball                       ball ball
# 3:  XYZ 3... XYZ 378978978                           XYZ 378978978
# 4:  RT Ca...            Call ca...    Call call call    Call call call
# 5:        IO                                                IO

I was wondering if someone can help me optimize this. R is new for me. I know apply functions exist, but I don't know how to work with those with custom functions like these.

I will appreciate your assistance and hint. Thank you.

EDIT: I did the following using data.table function, however I am missing some data:

sample_fxn <-  function(t,ft,st,sft){
  if(!grepl('^RT', t)){
    if((nchar(t) < nchar(ft)) & !is.na(ft)){
      return(ft)
    }else{
      return(t)
    }
  }
  else{
    if((nchar(st) < nchar(sft))& !is.na(sft)){
      return(sft)
    }else{
      return(st)
    }
  }
}

new_df <- new_df[ ,complete_texts := sample_fxn(text,
                                                full_text,
                                                status.text,
                                                status.full_text)]

#   text     full_text status.text  status.full_text         complete_texts
# 1: RT A y...            A yes y... A yes yes yes yes  A yes yes yes yes 
# 2:   RT b...             ball ball                                            
# 3:  XYZ 3... XYZ 378978978                                                   
# 4:  RT Ca...            Call ca...    Call call call    Call call call    
# 5:        IO

This is my best attempt after reading the vectorized version from that R Inferno book which @Henrik shared. I came up with:

new_df$complete_texts <- ifelse(!grepl('^RT', new_df$text),
                                yes = ifelse((nchar(new_df$text) < nchar(new_df$full_text))& !is.na(new_df$full_text),
                                             yes = new_df$full_text,
                                             no = new_df$text
                                ),
                                no = ifelse((nchar(new_df$status.text) < nchar(new_df$status.full_text))& !is.na(new_df$status.full_text),
                                            yes = new_df$status.full_text,
                                            no = new_df$status.text
                                )
                          )

This does gets the job done 3X faster. I was wondering if somebody can explain me better ways. I want to learn.

MichaelChirico · Accepted Answer

Make sure to read some intro material for data.table -- in particular Introduction and Reference Semantics vignettes.

Next, the most glaring thing I see is a lack of vectorization. In low-level languages, you have to loop all the time; in R, you should always think -- do I really need a loop here? In your code, I see a scalar version of several vectorized functions being used:

grepl works on vectors
nchar works on vectors
is.na works on vectors

Whenever possible, you should use the vector version -- there is some latency of calling a C function repeatedly compared to just calling it once:

for (i in 1:nrow(DT)) grepl('^RT', DT[i, "text"]) is guaranteed slower than grepl('^RT', DT$text)

Next, data.table has some further overhead in calling [ repeatedly, because there is a lot going on internally inside [ to handle more complex "queries", so you should try and take advantage of that when possible!

Lastly, rather than updating the data.table in your function, I would rather have the function return a vector that I can assign as a column -- the end goal is to have something like:

new_df[ , complete_text := my_function(.SD)]

Note that my_function(.SD) is the same as my_function(new_df) in this simple case -- the use of .SD here is to get used to this syntax in more complex scenarios; see the .SD vignette for more.

Here's my updated version of what I'll call get_complete_text:

get_complete_text = function(DT) {
  DT[ , fifelse(
    grepl('^RT', text),
    fifelse(
      nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text), 
      status.full_text,
      status.text
    ),
    fifelse(
      nchar(text) < nchar(full_text) & !is.na(full_text),
      full_text,
      text
    )
  )]
}
new_df[ , complete_text := get_complete_text(.SD)][]
#         text     full_text status.text  status.full_text     complete_text
# 1: RT A y...            A yes y... A yes yes yes yes A yes yes yes yes
# 2:   RT b...             ball ball                       ball ball
# 3:  XYZ 3... XYZ 378978978                           XYZ 378978978
# 4:  RT Ca...            Call ca...    Call call call    Call call call
# 5:        IO                                                IO

The nested fifelse could be simplified with an intermediate column storing which text columns to use based on the ^RT condition:

idx = new_df[grepl('^RT', text), which=TRUE]
new_df[idx, c('rt_text', 'rt_full_text') := .(status.text, status.full_text)]
new_df[-idx, c('rt_text', 'rt_full_text') := .(text, full_text)]

new_df[ , complete_text := 
  fifelse(
    nchar(rt_text) < nchar(rt_full_text) & !is.na(rt_full_text),
    rt_full_text,
    rt_text
  )
]

Alternatively, with the development version of data.table there is fcase, which you may find more readable (in this case, I think the nested fifelse works fine):

get_complete_text = function(DT) {
  DT[ , fcase(
    grepl('^RT', text) & nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text),
    status.full_text,
    grepl('^RT', text) & (nchar(status.full_text) >= nchar(status.text) | is.na(status.full_text)),
    status.text,
    # here, we're implicitly using that logically grepl('^RT') is now FALSE
    nchar(text) < nchar(full_text) & !is.na(full_text),
    full_text,
    # there is some ongoing work to make this less ugly,
    #   but for now this is the catchall term -- we could also
    #   explicitly write out the conjugate condition nchar(full_text) >= ...
    rep(TRUE, .N),
    text
  )]
}

Optimizing a slow for-loop operation

Answers (1)

Related Questions