Anthony Nash
Anthony Nash

Reputation: 1137

How to optimize iterating over a huge dataframe with non-unique rows

I understand that if R is not updating a variable in place within the confines of a for loop then I've just made some horrendously slow and expensive code. Unfortunately, with a set of very tight deadlines and a strong background in C++/Java it's my go-to behaviour until I can get my R hat on.

I have a function I need to improve. It takes a dataframe (as below) returns the unique patid values and uses those to retrieve subsets of that dataframe for date modifications. A trimmed example below (note, I just pulled this out of a completed run, so the date has already been modified). The last R run I performed was over a dataframe of 27 million row and took about four/five hours. The size of the dataframe will be a lot bigger.

patid eventdate
1     12/03/1998   
1     12/03/1998   
2     04/03/2007   
3     15/11/1980   
3     15/11/1980   
3     01/02/1981  

A trimmed example of the function:

rearrangeDates <- function(dataFrame) {
   #return a list of the unique patient ids
   uniquePatids <- getUniquePatidList(dataFrame) #this is only called once and is very fast

    out=NULL
    for(i in 1:length(uniquePatids)) { # iterate over the list 
        idf <- subset(dataFrame, dataFrame$patid=uniquePatids[[i]])
        idf$eventdate <- as.POSIXct(idf$eventdate,format="%d/%m/%Y")
        idf <- idf[order(idf$eventdate,decreasing=FALSE),]
        out = rbind(out,idf)
    }
    return(out)
}

Can anyone suggest improvements?

Upvotes: 0

Views: 82

Answers (2)

smci
smci

Reputation: 33960

This is ideally suited to data.table: your data has a well-defined key that you group-by (patid,eventdate), you know the size of the output df will be <= size of input df, so it's safe to do do in-place assignments (waaay faster) instead of appends, you don't need the output iterative-append, and data.table has a nice fast unique function. So please try out the (loop-free!) code below and let us know how it compares both to your original, and to the dplyr approach:

require(data.table)
dt = data.table(patid=c(1,1,2,3,3,3), eventdate=c('12/03/1998','12/03/1998',
         '04/03/2007', '15/11/1980', '15/11/1980','01/02/1981'))  
dt[, eventdate := as.POSIXct(eventdate,format="%d/%m/%Y") ]

# If you set a key, the `by` operation will be super-fast
setkeyv(dt, c('patid','eventdate'))

odt <- dt[, by=.(patid,eventdate)]

   patid  eventdate
1:     1 1998-03-12
2:     1 1998-03-12
3:     2 2007-03-04
4:     3 1980-11-15
5:     3 1980-11-15
6:     3 1981-02-01

(One last thing: don't be afraid of POSIXct/lt, convert to them early, they're more efficient than strings, they support comparison operators hence the column can be used as key, sorted on, compared.)

(And for the fastest dplyr implementation, use dplyr::distinct())

Upvotes: 1

Prem
Prem

Reputation: 11965

Since you want to sort your data on patid & eventdate this should work.

library(dplyr)

df %>% 
  mutate(eventdate = as.Date(eventdate, format="%d/%m/%Y")) %>% 
  arrange(patid, eventdate)

Output is:

  patid  eventdate
1     1 1998-03-12
2     1 1998-03-12
3     2 2007-03-04
4     3 1980-11-15
5     3 1980-11-15
6     3 1981-02-01

Sample data:

df <- structure(list(patid = c(1L, 1L, 2L, 3L, 3L, 3L), eventdate = c("12/03/1998", 
"12/03/1998", "04/03/2007", "15/11/1980", "15/11/1980", "01/02/1981"
)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 2

Related Questions