Reputation: 1069

R: sequence of days between dates

I have the following dataframes:

AllDays  
2012-01-01  
2012-01-02  
2012-01-03  
...  
2015-08-18  

Leases 
StartDate  EndDate
2012-01-01 2013-01-01  
2012-05-07 2013-05-06  
2013-09-05 2013-12-01

What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.

I have the following code

  for (i in 1:nrow(leases))
  {
    occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
    occupied = occupied[occupied < dateOfInt]
    matching = match(occupied,allDays$Date)
    allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
  }

which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time? Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.

Upvotes: 2

Answers (5)

Khashaa

Reputation: 7373

This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).

Based on @MichaelChirico's data.

setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
#   user  system elapsed 
#  0.114   0.018   0.136
# @MichaelChirico's approach
#   user  system elapsed 
#  0.909   0.000   0.907

Here is a brief explanation on how it works by @Arun, which got me started with the data.table.

Upvotes: 5

MichaelChirico

Reputation: 34763

Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.

I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.

Data

set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
                             as.Date("2015-08-18"),"day"))

leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))

Approach

Use data.table and sapply:

library(data.table)
setDT(leases); setDT(days)

days[,lease_count:=
       sapply(AllDays,function(x)
         leases[StartDate<=x&EndDate>=x,.N])][]
         AllDays lease_count
   1: 2012-01-01           5
   2: 2012-01-02           8
   3: 2012-01-03          11
   4: 2012-01-04          16
   5: 2012-01-05          18
  ---                       
1322: 2015-08-14        1358
1323: 2015-08-15        1358
1324: 2015-08-16        1360
1325: 2015-08-17        1363
1326: 2015-08-18        1359

Upvotes: 6

ulfelder

Reputation: 5335

Without your data, I can't test whether or not this is faster, but it gets the job done with less code:

for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)

I used the following to test it; note that the relevant columns in both data frames are formatted as dates:

AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))

Upvotes: 2

slightlydeviant

Reputation: 44

Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.

# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
                 "EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)

x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})

The Output

head(AllDays)
        Days NumberInEffect
1 2012-01-01              1
2 2012-01-02              1
3 2012-01-03              1
4 2012-01-04              1
5 2012-01-05              1
6 2012-01-06              1

Upvotes: 0

AntoniosK

Reputation: 16121

An alternative approach, but I'm not sure it's faster.

library(lubridate)
library(dplyr)

AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))

Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
                   end = c("2012-02-05","2012-04-15","2012-07-11"))

# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)

# create the range id
Lease$id = 1:nrow(Lease)

AllDays

#        dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03

Lease

#       start        end id
# 1 2012-01-03 2012-02-05  1
# 2 2012-03-01 2012-04-15  2
# 3 2012-04-02 2012-07-11  3


data.frame(expand.grid(AllDays$dates,Lease$id)) %>%      # create combinations of dates and ranges
  select(dates=Var1, id=Var2) %>%
  inner_join(Lease, by="id") %>%                         # join information
  rowwise %>%
  do(data.frame(dates=.$dates,
                flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>%     # create ranges and check if the date is in there
  ungroup %>%
  group_by(dates) %>%
  summarise(N=sum(flag))

#        dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2

Upvotes: 1

R: sequence of days between dates

Answers (5)

Data

Approach

Related Questions