Reputation: 59
I'm trying to summarize a data.frame
which contains date (or time) information.
Let's suppose this one containing hospitalization records by patient:
df <- data.frame(c(1, 2, 1, 1, 2, 2),
c(as.Date("2013/10/15"), as.Date("2014/10/15"), as.Date("2015/7/16"), as.Date("2016/1/7"), as.Date("2015/12/20"), as.Date("2015/12/25")))
names(df) <- c("patient.id", "hospitalization.date")
df
looks like this:
> df
patient.id hospitalization.date
1 1 2013-10-15
2 2 2014-10-15
3 1 2015-07-16
4 1 2016-01-07
5 2 2015-12-20
6 2 2015-12-25
For each observation, I need to count the number of hospitalizations occuring in the 365 days before that hospitalization.
In my example it would be the new df$hospitalizations.last.year
column.
> df
patient.id hospitalization.date hospitalizations.last.year
1 1 2013-10-15 1
2 2 2014-10-15 1
3 1 2015-07-16 1
4 2 2015-12-20 1
5 2 2015-12-25 2
6 1 2016-01-07 2
7 2 2016-02-10 3
Note that the counter is including the number of previous records in the last 365 days, not only in the current year.
I'm trying to do that using dplyr
or data.table
because my dataset is huge and performance matters. ¿Is it possible?
Upvotes: 0
Views: 297
Reputation: 42544
Since version 1.9.8 (on CRAN 25 Nov 2016), data.table
offers non-equi joins:
library(data.table)
# coerce to data.table
setDT(df)[
# create helper column
, date_365 := hospitalization.date - 365][
# step1: non-equi self-join
df, on = c("patient.id", "hospitalization.date>=date_365",
"hospitalization.date<=hospitalization.date")][
# step 2: count hospitalizations.last.year for each patient
, .(hospitalizations.last.year = .N),
by = .(patient.id, hospitalization.date = hospitalization.date.1)]
patient.id hospitalization.date hospitalizations.last.year 1: 1 2013-10-15 1 2: 2 2014-10-15 1 3: 1 2015-07-16 1 4: 2 2015-12-20 1 5: 2 2015-12-25 2 6: 1 2016-01-07 2 7: 2 2016-02-10 3
Edit: Join and aggregation can be combined in one step:
# coerce to data.table
setDT(df)[
# create helper column
, date_365 := hospitalization.date - 365][
# non-equi self-join
df, on = c("patient.id", "hospitalization.date>=date_365",
"hospitalization.date<=hospitalization.date"),
# count hospitalizations.last.year grouped by join parameters
.(hospitalizations.last.year = .N), by = .EACHI][
# remove duplicate column
, hospitalization.date := NULL][]
The result is the same as above.
The OP has provided two data sets with 6 and 7 rows, resp. Here, the data set with 7 rows is used as it was posted as expected result:
df <- data.frame(
patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16",
"2016/1/7", "2015/12/20", "2015/12/25", "2016/2/10")))
df <- df[order(df$hospitalization.date), ]
Upvotes: 2