Mike
Mike

Reputation: 1069

R: sequence of days between dates

I have the following dataframes:

AllDays  
2012-01-01  
2012-01-02  
2012-01-03  
...  
2015-08-18  

Leases 
StartDate  EndDate
2012-01-01 2013-01-01  
2012-05-07 2013-05-06  
2013-09-05 2013-12-01   

What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.

I have the following code

  for (i in 1:nrow(leases))
  {
    occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
    occupied = occupied[occupied < dateOfInt]
    matching = match(occupied,allDays$Date)
    allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
  }

which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time? Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.

Upvotes: 2

Views: 3307

Answers (5)

Khashaa
Khashaa

Reputation: 7373

This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).

Based on @MichaelChirico's data.

setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
#   user  system elapsed 
#  0.114   0.018   0.136
# @MichaelChirico's approach
#   user  system elapsed 
#  0.909   0.000   0.907 

Here is a brief explanation on how it works by @Arun, which got me started with the data.table.

Upvotes: 5

MichaelChirico
MichaelChirico

Reputation: 34703

Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.

I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.

Data

set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
                             as.Date("2015-08-18"),"day"))

leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))

Approach

Use data.table and sapply:

library(data.table)
setDT(leases); setDT(days)

days[,lease_count:=
       sapply(AllDays,function(x)
         leases[StartDate<=x&EndDate>=x,.N])][]
         AllDays lease_count
   1: 2012-01-01           5
   2: 2012-01-02           8
   3: 2012-01-03          11
   4: 2012-01-04          16
   5: 2012-01-05          18
  ---                       
1322: 2015-08-14        1358
1323: 2015-08-15        1358
1324: 2015-08-16        1360
1325: 2015-08-17        1363
1326: 2015-08-18        1359

Upvotes: 6

ulfelder
ulfelder

Reputation: 5335

Without your data, I can't test whether or not this is faster, but it gets the job done with less code:

for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)

I used the following to test it; note that the relevant columns in both data frames are formatted as dates:

AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))

Upvotes: 2

slightlydeviant
slightlydeviant

Reputation: 44

Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.

# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
                 "EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)

x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})

The Output

head(AllDays)
        Days NumberInEffect
1 2012-01-01              1
2 2012-01-02              1
3 2012-01-03              1
4 2012-01-04              1
5 2012-01-05              1
6 2012-01-06              1

Upvotes: 0

AntoniosK
AntoniosK

Reputation: 16121

An alternative approach, but I'm not sure it's faster.

library(lubridate)
library(dplyr)

AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))

Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
                   end = c("2012-02-05","2012-04-15","2012-07-11"))

# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)

# create the range id
Lease$id = 1:nrow(Lease)

AllDays

#        dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03

Lease

#       start        end id
# 1 2012-01-03 2012-02-05  1
# 2 2012-03-01 2012-04-15  2
# 3 2012-04-02 2012-07-11  3


data.frame(expand.grid(AllDays$dates,Lease$id)) %>%      # create combinations of dates and ranges
  select(dates=Var1, id=Var2) %>%
  inner_join(Lease, by="id") %>%                         # join information
  rowwise %>%
  do(data.frame(dates=.$dates,
                flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>%     # create ranges and check if the date is in there
  ungroup %>%
  group_by(dates) %>%
  summarise(N=sum(flag))

#        dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2

Upvotes: 1

Related Questions