Splitting a row at year change

Question

I have a large data set of data representing paired blocks of time, however I want to be able to have a clean break across year boundaries with each row starting and finishing in the same year.

As an example see the table below.

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1      154    519    2        730
4     0      216    735    3       1095

There is no overlap between years one and two as row 3 starts on the first day of year two, however row 4 starts in year two and ends 5 days into year three. I want to split row 4 so that the table looks like the following.

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1        0    519    1        365
4     1      154    519    2        730
5     0      211    524    2        730
6     0        5    735    3       1095

As can be seen there is no overlap across years as each overlapping block of time has been split up so each row starts and finishes in the same year. The way I have done this so far is as follows, however it seems clunky and I would hope there is a more elegant solution.

set.seed(808)
test <- data.frame(type = c(1,0), duration =  round(runif(20, min = 100, max = 250))) %>%
  mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )

test <- rbind(test[1,],
      filter(test, lag(year) == year), 
      filter(test, lag(year) != year) %>% 
      mutate( duration = cumsum - (year-1)*365),
      filter(test, lag(year) != year) %>% 
        mutate( duration = ((year-1)*365 + duration- cumsum), 
                cumsum = cumsum-duration, 
                year = year -1, 
                year.split = year*365) ) %>% arrange(year, cumsum)


test <- group_by( test,type, year) %>%
  summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)

The final two lines of code summarises the data as I am interested in the total amount of each type per year.

What is a better way of doing this?

mrip · Accepted Answer

This seems to work, assuming that the durations are all strictly positive:

cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
           duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))

  type duration cumsum year
1    1      236    236    1
2    0      129    365    1
3    1      154    519    2
4    0      211    730    2
5    0        5    735    3

Splitting a row at year change

Answers (2)

Related Questions