Reputation: 1981
I have a large data set of data representing paired blocks of time, however I want to be able to have a clean break across year boundaries with each row starting and finishing in the same year.
As an example see the table below.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 154 519 2 730
4 0 216 735 3 1095
There is no overlap between years one and two as row 3 starts on the first day of year two, however row 4 starts in year two and ends 5 days into year three. I want to split row 4 so that the table looks like the following.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 0 519 1 365
4 1 154 519 2 730
5 0 211 524 2 730
6 0 5 735 3 1095
As can be seen there is no overlap across years as each overlapping block of time has been split up so each row starts and finishes in the same year. The way I have done this so far is as follows, however it seems clunky and I would hope there is a more elegant solution.
set.seed(808)
test <- data.frame(type = c(1,0), duration = round(runif(20, min = 100, max = 250))) %>%
mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )
test <- rbind(test[1,],
filter(test, lag(year) == year),
filter(test, lag(year) != year) %>%
mutate( duration = cumsum - (year-1)*365),
filter(test, lag(year) != year) %>%
mutate( duration = ((year-1)*365 + duration- cumsum),
cumsum = cumsum-duration,
year = year -1,
year.split = year*365) ) %>% arrange(year, cumsum)
test <- group_by( test,type, year) %>%
summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)
The final two lines of code summarises the data as I am interested in the total amount of each type per year.
What is a better way of doing this?
Upvotes: 1
Views: 70
Reputation: 7119
Not sure if it's the R
way that you are looking for, but you can simplify a bit your rbind
function:
rbind (filter(test, cumsum - duration >= (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(duration = cumsum - (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(year = year - 1, # I'm changing the year first so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365)
)
As you can see I combine three data.frame:
There are two things I don't like here: I used twice the same filter (for case 2 and 3) and tomorrow I will need 10/15 minutes to understand this code (or I can put a comment like # It works, don't worry
).
I think that a more verbose version of this code will be easier to maintain:
# These don't overlap
ok <- filter(test, cumsum - duration >= (year - 1) * 365)
# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)
# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko,
duration = cumsum - (year - 1) * 365
)
# For the previous year, a bit more
ko.previous <- mutate(ko,
year = year - 1, # I'm changing the year first
# so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365
)
# Let me put them back together and sort them for you
test1 <- rbind (ok,
ko.recent,
ko.previous
)
Not sure if this was the answer you were looking for, I'm just learning R
.
Upvotes: 0
Reputation: 15163
This seems to work, assuming that the durations are all strictly positive:
cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))
type duration cumsum year
1 1 236 236 1
2 0 129 365 1
3 1 154 519 2
4 0 211 730 2
5 0 5 735 3
Upvotes: 2