Reputation: 850
Please consider the following:
In a data.frame
similar to the example below, the dates of when a medication was given are noted per patient. The goal is to compute the "time to next treatment" which is defined as the days between the start of one treatment until the start of the next treatment. All other columns in the data.frame
(not shown here) hold distinct information that need to be kept.
My approach is as follows:
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
db <- data.frame(id = c(rep("a", 5), rep("b", 3)),
date = c(rep(as.Date("2018-01-01"), 3),
rep(as.Date("2018-01-20"), 2),
rep(as.Date("2018-01-01"), 3)))
db
#> id date
#> 1 a 2018-01-01
#> 2 a 2018-01-01
#> 3 a 2018-01-01
#> 4 a 2018-01-20
#> 5 a 2018-01-20
#> 6 b 2018-01-01
#> 7 b 2018-01-01
#> 8 b 2018-01-01
db %>%
group_by(id) %>%
mutate(time.to.next = as.numeric(lead(date) - date))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> # A tibble: 8 x 3
#> # Groups: id [2]
#> id date time.to.next
#> <fct> <date> <dbl>
#> 1 a 2018-01-01 0.
#> 2 a 2018-01-01 0.
#> 3 a 2018-01-01 19.
#> 4 a 2018-01-20 0.
#> 5 a 2018-01-20 NA
#> 6 b 2018-01-01 0.
#> 7 b 2018-01-01 0.
#> 8 b 2018-01-01 NA
Created on 2018-08-13 by the reprex package (v0.2.0).
However, what I need is a data.frame
(or tibble
) looking like this:
#> # A tibble: 8 x 3
#> # Groups: id [2]
#> id date time.to.next
#> <fct> <date> <dbl>
#> 1 a 2018-01-01 19.
#> 2 a 2018-01-01 19.
#> 3 a 2018-01-01 19.
#> 4 a 2018-01-20 NA
#> 5 a 2018-01-20 NA
#> 6 b 2018-01-01 NA
#> 7 b 2018-01-01 NA
#> 8 b 2018-01-01 NA
Question: how can I achieve that all values per group are the same although the only computed values is the difference between the last observation of one group and the first observation of the subsequent group?
Thank you very much.
Upvotes: 0
Views: 151
Reputation: 16121
An alternative option would be to calculate the distance between each date
and the max(date)
of that id and then replace zeros with NA
db <- data.frame(id = c(rep("a", 5), rep("b", 3)),
date = c(rep(as.Date("2018-01-01"), 3),
rep(as.Date("2018-01-20"), 2),
rep(as.Date("2018-01-01"), 3)))
library(dplyr)
db %>%
group_by(id) %>%
mutate(time.to.next = as.numeric(max(date) - date),
time.to.next = ifelse(time.to.next > 0, time.to.next, NA)) %>%
ungroup()
# # A tibble: 8 x 3
# id date time.to.next
# <fct> <date> <dbl>
# 1 a 2018-01-01 19
# 2 a 2018-01-01 19
# 3 a 2018-01-01 19
# 4 a 2018-01-20 NA
# 5 a 2018-01-20 NA
# 6 b 2018-01-01 NA
# 7 b 2018-01-01 NA
# 8 b 2018-01-01 NA
Upvotes: 1
Reputation: 215047
One option is to firstly drop all duplicates in id and date, calculate the time difference and then join back with db
on id and date columns:
db %>%
select(id, date) %>%
distinct() %>%
group_by(id) %>%
mutate(time.to.next = as.numeric(lead(date) - date)) %>%
inner_join(db)
#Joining, by = c("id", "date")
# A tibble: 8 x 3
# Groups: id [?]
# id date time.to.next
# <fct> <date> <dbl>
#1 a 2018-01-01 19
#2 a 2018-01-01 19
#3 a 2018-01-01 19
#4 a 2018-01-20 NA
#5 a 2018-01-20 NA
#6 b 2018-01-01 NA
#7 b 2018-01-01 NA
#8 b 2018-01-01 NA
Upvotes: 3