Pre-allocation of memory in custom R function to improve performance (using dplyr)

EDIT: as I am not at all familiar with data.table, does anyone have any ideas for other solutions besides switching to data.table? Thanks a ton!

I have quite a big data set, which contains the startdates and enddates of different types of incidents (every row contains an incident with the respective startdate and enddate). Now I would like to know if there was an incident of the same type directly before or after the current incident. The tricky thing is, that holidays and weekends in between the incidents do not count/should not be considered.

Example: Incident of type 1 starts on Wednesday, ends on Friday, then there is the weekend and on Monday another type 1 incident starts and lasts until Friday. In this case, the "incident_directly_before" would be true (=1) for the second one, because the two incidents are only separated by a weekend, which should not be considered, and false (=0) for the first one, because it was the first of its kind.

I have written a function for this, but it is quite slow.

My question now is: do you have any idea how to improve the performance of the code?

I have read about pre-allocation of memory, but as I do not have any "for(i in 1:n)" I am not sure how to do so.

I have also tried cmpfun() from the compiler package, but it performed more or less the same (even slightly worse) than the original.

As I do not have a CS background and just to dig into the topic of code optimization, I would really be happy about some help and also explanation why certain approaches (do not) work in my case.

Packages:

  library(dplyr)
  library(lubridate)

Example data:

df <- structure(list(start = structure(c(16920, 16961, 16988, 17008, 13563, 13598, 13819, 13880, 13886, 
                                                 13887, 13892, 13899, 13907, 13910, 13969, 14487, 14488, 14550, 
                                                 14606, 14676, 14743, 14819, 14841, 14851, 14915, 14984), class = "Date"), 
                     end = structure(c(16927,16965, 16990, 17011, 13595, 13616, 13875, 13885, 13886, 13889, 
                                               13896, 13906, 13909, 13966, 13969, 14487, 14496, 14554, 14608, 
                                               14680, 14743, 14820, 14841, 14862, 14918, 14985), class = "Date"), 
                     type = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 5, 6, 7, 8, 8, 9, 9, 9, 9, 9, 9)), 
                class = "data.frame", row.names = c(NA, -26L))

Example of my custom holidays vector:

holidays <- as.Date(c("2009-12-30", "2009-12-31", "2010-01-01"))

My function to check if there was an incident of the same type just before (weekend and holidays excluded):

incident_function <- function(startdate, enddate, lagstart, lagend) {
  if (is.na(lagstart) ||is.na(lagend) ) {
    priorincident <- 0
  } else {
    daycount <- 0
    priorincident <- 0
    day_start <- as.Date(startdate) - lubridate::duration(1, 'days')
    while (day_start %in% holidays || weekdays(day_start) %in% c("Saturday", "Sunday")) { 
      daycount <- daycount +1
      day_start <- (as.Date(day_start) - lubridate::duration(1, 'days'))
    } 
    { if (as.Date(day_start) %in% seq.Date(lagstart, lagend, by='days')){
      priorincident <- 1
    } else {
      priorincident <- 0
    } 
    }
    return(priorincident) 
  }
}

The function basically does the following: 1) if it is the first incident of the type/lag incident is NA, assign a 0 to priorincident (=there was no prior incident of the same type). 2) else: take the startdate of the current row and see if one day before was a holiday or Saturday/Sunday; if yes, go back one day further and check again (...). If then the startdate minus n-days is neither a holiday, nor a Saturday/Sunday, nor the enddate of the lag-incident, assign 0 to priorincident, however, if startdate minus n-days is the enddate of the prior incident, assign 1 to priorincident (=there was a prior incident of the same type).

(The "same-type" aspect is covered due to the group_by(type) in the dplyr pipe)

Then I used dplyr to group by incident-type and then apply the incident_function:

df %>%
  group_by(type) %>%
  dplyr::mutate(incident_directly_before = mapply(incident_function, startdate=start, enddate=end, lagstart=dplyr::lag(start), lagend=dplyr::lag(end))) -> df

   start      end         type incident_directly_before
   <date>     <date>     <dbl>                    <dbl>
 1 2016-04-29 2016-05-06     1                        0
 2 2016-06-09 2016-06-13     1                        0
 3 2016-07-06 2016-07-08     1                        0
 4 2016-07-26 2016-07-29     1                        0
 5 2007-02-19 2007-03-23     2                        0
 6 2007-03-26 2007-04-13     2                        1
 7 2007-11-02 2007-12-28     2                        0
 8 2008-01-02 2008-01-07     2                        0
 9 2008-01-08 2008-01-08     2                        1
10 2008-01-09 2008-01-11     2                        1
11 2008-01-14 2008-01-18     2                        1
12 2008-01-21 2008-01-28     3                        0
13 2008-01-29 2008-01-31     4                        0
14 2008-02-01 2008-03-28     4                        1
15 2008-03-31 2008-03-31     4                        1
16 2009-08-31 2009-08-31     5                        0
17 2009-09-01 2009-09-09     6                        0
18 2009-11-02 2009-11-06     7                        0
19 2009-12-28 2009-12-30     8                        0
20 2010-03-08 2010-03-12     8                        0
21 2010-05-14 2010-05-14     9                        0
22 2010-07-29 2010-07-30     9                        0
23 2010-08-20 2010-08-20     9                        0
24 2010-08-30 2010-09-10     9                        0
25 2010-11-02 2010-11-05     9                        0
26 2011-01-10 2011-01-11     9                        0

Thanks so much in advance for not letting me waste my life staring at that sweet little red octagon!

Upvotes: 2

Answers (2)

Wimpel

Reputation: 27732

another data.table approach, which takes Saturdays and Sundays into account...

code

library(data.table)
setDT(df)

#get the day before and the day after, exclude saturdays and sundays
# use wday(start), sunday = 1, saturday = 7
# detrmine previous and next days..
# you can add holidays the same way...
df[ ,`:=`(id = seq.int(.N), prevDay = start - 1, nextDay = end + 1 )]
df[ wday(start) == 7, prevDay := start - 1 ]
df[ wday(start) == 1, prevDay := start - 2 ]
df[ wday(end) == 7, nextDay := start + 2 ]
df[ wday(end) == 1, nextDay := start + 1 ]
setcolorder(df, "id")

#perform join on self
df[df, overlap_id_after := i.id, on = .(type, nextDay == start)]
df[df, overlap_id_before := i.id, on = .(type, prevDay == start)]

sample data

df <- structure(list(start = structure(c(16920, 16961, 16988, 17008, 13563, 13598, 13819, 13880, 13886, 
                                         13887, 13892, 13899, 13907, 13910, 13969, 14487, 14488, 14550, 
                                         14606, 14676, 14743, 14819, 14841, 14851, 14915, 14984), class = "Date"), 
                     end = structure(c(16927,16965, 16990, 17011, 13595, 13616, 13875, 13885, 13886, 13889, 
                                       13896, 13906, 13909, 13966, 13969, 14487, 14496, 14554, 14608, 
                                       14680, 14743, 14820, 14841, 14862, 14918, 14985), class = "Date"), 
                     type = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 5, 6, 7, 8, 8, 9, 9, 9, 9, 9, 9)), 
                class = "data.frame", row.names = c(NA, -26L))

Upvotes: 2

Jaccar

Reputation: 1844

Although there will be other ways to make this faster, I am a big advocate of using data.table when you need things to be speedier.

Therefore, if I just change your dataframe to a data.table, the time is more than halved:

    dt <- as.data.table(df)

    dt[, incident_directly_before := mapply(incident_function, 
                                            startdate = start, 
                                            enddate=end, 
                                            lagstart=dplyr::lag(start),
                                            lagend=dplyr::lag(end)),
       by = type]

Using your original code, this section took me 0.2451596 secs. Using data.table took me 0.1155329 secs.

This is because data.table mutates in place rather than creating a copy of the data.

Upvotes: 1

Pre-allocation of memory in custom R function to improve performance (using dplyr)

Answers (2)

Related Questions