Reputation: 518

For loop generating months between dates in R

I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). my objective was to create another data frame which has two columns, one is employee ID and the other one is date. Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. In simple words , i would expand the data in first data frame by months according to the employee start date and end date.

I actually successfully created the code, using for loop. Problem is, it is very slower, and some where I read that one is to avoid loops in r. is there a way that can do the same in a much quicker way ?

an example of my data frame and code is below:

# Creating Data frame
    a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
                   EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
    a$StartDate <- ydm(a$StartDate)
    a$EndDate <- ydm(a$EndDate)

    #second empty data frame
    a1 <-a
    a1 <- a1[0,1:2]

    #my code starts
    r <- 1
    r.1 <- 1
    for (id in a$employeeid) {

      #r.1 <- 1
      for ( i  in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) { 
        a1[r.1,1] <- a[r,1]
        a1[r.1,2] <- i
        r.1 <- r.1 +1  
      } 
      r <- r+1
    }

This results in this :

I want the same result, but a bit quicker

Upvotes: 0

Answers (4)

Uwe

Reputation: 42544

For the sake of completeness, here is a concise one-line with data.table:

library(data.table)
setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]

    employeeid  StartDate
 1:          a 2018-01-01
 2:          a 2018-02-01
 3:          a 2018-03-01
 4:          b 2018-05-01
 5:          b 2018-06-01
 6:          b 2018-07-01
 7:          b 2018-08-01
 8:          b 2018-09-01
 9:          c 2018-02-11
10:          c 2018-03-11
11:          c 2018-04-11
12:          c 2018-05-11
13:          c 2018-06-11
14:          c 2018-07-11

Upvotes: 0

Roman

Reputation: 4989

Almost a one-liner with tidyverse:

> result
# A tibble: 12 x 2
   employeeid date      
   <chr>      <date>    
 1 a          2018-01-01
 2 a          2018-02-01
 3 a          2018-03-01
 4 b          2018-05-01
 5 b          2018-06-01
 6 b          2018-07-01
 7 b          2018-08-01
 8 b          2018-09-01
 9 c          2018-11-01
10 c          2018-12-01
11 c          2019-01-01
12 c          2019-02-01

Code

result <- df %>%
    group_by(employeeid) %>%
    summarise(date = list(seq(StartDate,
                              EndDate,
                              by = "month"))) %>%
    unnest()

Data

library(tidyverse)
library(lubridate)
df <- data.frame(employeeid = c('a', 'b', 'c'), 
                 StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
                 EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
                 stringsAsFactors = FALSE)

Upvotes: 2

Etienne Kintzler

Reputation: 682

You can use a combination of apply and do.call:

out_apply_list <- apply(X=a, MARGIN=1,
                    FUN=function(x) {
                      data.frame(id= x[1], 
                                 date=seq(from = as.Date(x[2], "%Y-%d-%m"), 
                                          to = as.Date(x[3], "%Y-%d-%m"), 
                                          by = "month"),
                                 row.names = NULL) 
})

df <- do.call(what = rbind, args = out_apply_list)

which gives you the following output:

> df
   id       date
1   a 2018-01-01
2   a 2018-02-01
3   a 2018-03-01
4   b 2018-05-01
5   b 2018-06-01
6   b 2018-07-01
7   b 2018-08-01
8   b 2018-09-01
9   c 2018-02-11
10  c 2018-03-11
11  c 2018-04-11
12  c 2018-05-11
13  c 2018-06-11
14  c 2018-07-11

Upvotes: 1

TinglTanglBob

Reputation: 647

I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start.

Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string.

library(lubridate)

# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
               EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)

# create month-name month nummeric value mapping
month_names = month.abb[1:12]


month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
{
  start = dates[1] # first unit of the vector is expected to be the start date
  end = dates[2] # second unit is expected to be the end date

  start_month = month(start)
  end_month = month(end) 
  start_year = year(start) 
  end_year = year(end)
  year_dif = end_year - start_year

  if(year_dif == 0){ #if start and end both are in the same year month is start till end
    return(paste(month_names[start_month:end_month], collapse= ", " ))
  } else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
          paste(c(month_names[start_month:12],
          rep(month_names, year_dif-1),
          month_names[1:end_month]), collapse = ", ")
  }
}

apply(a[2:3], 1, month_dif)

output:

> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"                                                                 
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"          
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"

Upvotes: 1

For loop generating months between dates in R

Answers (4)

Code

Data

Related Questions