Reputation: 1137
I understand that if R is not updating a variable in place within the confines of a for loop then I've just made some horrendously slow and expensive code. Unfortunately, with a set of very tight deadlines and a strong background in C++/Java it's my go-to behaviour until I can get my R hat on.
I have a function I need to improve. It takes a dataframe (as below) returns the unique patid values and uses those to retrieve subsets of that dataframe for date modifications. A trimmed example below (note, I just pulled this out of a completed run, so the date has already been modified). The last R run I performed was over a dataframe of 27 million row and took about four/five hours. The size of the dataframe will be a lot bigger.
patid eventdate
1 12/03/1998
1 12/03/1998
2 04/03/2007
3 15/11/1980
3 15/11/1980
3 01/02/1981
A trimmed example of the function:
rearrangeDates <- function(dataFrame) {
#return a list of the unique patient ids
uniquePatids <- getUniquePatidList(dataFrame) #this is only called once and is very fast
out=NULL
for(i in 1:length(uniquePatids)) { # iterate over the list
idf <- subset(dataFrame, dataFrame$patid=uniquePatids[[i]])
idf$eventdate <- as.POSIXct(idf$eventdate,format="%d/%m/%Y")
idf <- idf[order(idf$eventdate,decreasing=FALSE),]
out = rbind(out,idf)
}
return(out)
}
Can anyone suggest improvements?
Upvotes: 0
Views: 82
Reputation: 33960
This is ideally suited to data.table
: your data has a well-defined key that you group-by (patid,eventdate)
, you know the size of the output df will be <= size of input df, so it's safe to do do in-place assignments (waaay faster) instead of appends, you don't need the output iterative-append, and data.table has a nice fast unique
function. So please try out the (loop-free!) code below and let us know how it compares both to your original, and to the dplyr
approach:
require(data.table)
dt = data.table(patid=c(1,1,2,3,3,3), eventdate=c('12/03/1998','12/03/1998',
'04/03/2007', '15/11/1980', '15/11/1980','01/02/1981'))
dt[, eventdate := as.POSIXct(eventdate,format="%d/%m/%Y") ]
# If you set a key, the `by` operation will be super-fast
setkeyv(dt, c('patid','eventdate'))
odt <- dt[, by=.(patid,eventdate)]
patid eventdate
1: 1 1998-03-12
2: 1 1998-03-12
3: 2 2007-03-04
4: 3 1980-11-15
5: 3 1980-11-15
6: 3 1981-02-01
(One last thing: don't be afraid of POSIXct/lt, convert to them early, they're more efficient than strings, they support comparison operators hence the column can be used as key, sorted on, compared.)
(And for the fastest dplyr
implementation, use dplyr::distinct()
)
Upvotes: 1
Reputation: 11965
Since you want to sort your data on patid
& eventdate
this should work.
library(dplyr)
df %>%
mutate(eventdate = as.Date(eventdate, format="%d/%m/%Y")) %>%
arrange(patid, eventdate)
Output is:
patid eventdate
1 1 1998-03-12
2 1 1998-03-12
3 2 2007-03-04
4 3 1980-11-15
5 3 1980-11-15
6 3 1981-02-01
Sample data:
df <- structure(list(patid = c(1L, 1L, 2L, 3L, 3L, 3L), eventdate = c("12/03/1998",
"12/03/1998", "04/03/2007", "15/11/1980", "15/11/1980", "01/02/1981"
)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 2