Jonno Bourne
Jonno Bourne

Reputation: 1981

Splitting a row at year change

I have a large data set of data representing paired blocks of time, however I want to be able to have a clean break across year boundaries with each row starting and finishing in the same year.

As an example see the table below.

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1      154    519    2        730
4     0      216    735    3       1095

There is no overlap between years one and two as row 3 starts on the first day of year two, however row 4 starts in year two and ends 5 days into year three. I want to split row 4 so that the table looks like the following.

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1        0    519    1        365
4     1      154    519    2        730
5     0      211    524    2        730
6     0        5    735    3       1095

As can be seen there is no overlap across years as each overlapping block of time has been split up so each row starts and finishes in the same year. The way I have done this so far is as follows, however it seems clunky and I would hope there is a more elegant solution.

set.seed(808)
test <- data.frame(type = c(1,0), duration =  round(runif(20, min = 100, max = 250))) %>%
  mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )

test <- rbind(test[1,],
      filter(test, lag(year) == year), 
      filter(test, lag(year) != year) %>% 
      mutate( duration = cumsum - (year-1)*365),
      filter(test, lag(year) != year) %>% 
        mutate( duration = ((year-1)*365 + duration- cumsum), 
                cumsum = cumsum-duration, 
                year = year -1, 
                year.split = year*365) ) %>% arrange(year, cumsum)


test <- group_by( test,type, year) %>%
  summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)

The final two lines of code summarises the data as I am interested in the total amount of each type per year.

What is a better way of doing this?

Upvotes: 1

Views: 70

Answers (2)

mucio
mucio

Reputation: 7119

Not sure if it's the R way that you are looking for, but you can simplify a bit your rbind function:

rbind (filter(test, cumsum - duration >= (year - 1) * 365),
       filter(test, cumsum - duration < (year - 1) * 365) %>%
         mutate(duration = cumsum - (year - 1) * 365),
       filter(test, cumsum - duration < (year - 1) * 365) %>%
         mutate(year = year - 1, # I'm changing the year first so it will propagate
                duration = duration - (cumsum - (year * 365)),
                cumsum = (year) * 365,
                year.split = year * 365) 
               )

As you can see I combine three data.frame:

  1. Row which are correct, because the duration doesn't overlap two years
  2. I take the rows overlapping and I set the duration to the number of days in the last year
  3. I take the same rows and I change the values accordingly to the previous year.

There are two things I don't like here: I used twice the same filter (for case 2 and 3) and tomorrow I will need 10/15 minutes to understand this code (or I can put a comment like # It works, don't worry).

I think that a more verbose version of this code will be easier to maintain:

# These don't overlap        
ok <- filter(test, cumsum - duration >= (year - 1) * 365)

# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)

# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko, 
                    duration = cumsum - (year - 1) * 365
) 

# For the previous year, a bit more
ko.previous <- mutate(ko, 
                      year = year - 1, # I'm changing the year first
                                       # so it will propagate
                      duration = duration - (cumsum - (year * 365)),
                      cumsum = (year) * 365,
                      year.split = year * 365
) 

# Let me put them back together and sort them for you
test1 <- rbind (ok,
               ko.recent,
               ko.previous
              ) 

Not sure if this was the answer you were looking for, I'm just learning R.

Upvotes: 0

mrip
mrip

Reputation: 15163

This seems to work, assuming that the durations are all strictly positive:

cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
           duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))

  type duration cumsum year
1    1      236    236    1
2    0      129    365    1
3    1      154    519    2
4    0      211    730    2
5    0        5    735    3

Upvotes: 2

Related Questions