Reputation: 43
I'm trying to calculate a daily/weekly prevalence rate for a condition, but the sample size in the denominator varies over time. I have a dataset that includes the date on which each subject entered and left the sample (e.g. birth/death dates), the date on which each subject contracted the condition if applicable, and some demographic characteristics.
Example data:
ex <- data.frame(
id=seq(1:10),
birth=as.Date(c("12/01/2020", "12/01/2020", "12/01/2020", "12/02/2020", "12/02/2020",
"12/02/2020", "12/03/2020", "12/04/2020", "12/04/2020", "12/04/2020")),
sick=as.Date(c("12/03/2020", "12/04/2020", "12/02/2020", "12/03/2020", "12/06/2020",
NA, "12/06/2020", "12/07/2020", "12/09/2020", NA)),
death=as.Date(c("12/05/2020", "12/05/2020", "12/04/2020", "12/08/2020", "12/07/2020",
NA, "12/07/2020", "12/09/2020", "12/10/2020", NA)),
gender=c("male", "male", "female", "female", "female", "male", "female", "male", "male", "male")
)
Desired output:
sick <- data.frame(
date=c("12/01/2020", "12/02/2020", "12/03/2020", "12/04/2020", "12/05/2020",
"12/06/2020", "12/07/2020", "12/08/2020", "12/09/2020", "12/10/2020"),
count_alive=c(3, 6, 7, 9, 7, 7, 5, 4, 3, 2),
count_sick=c(0, 1, 3, 4, 1, 4, 2, 1, 1, 0)
)
sick$pct_sick <- sick$count_sick/sick$count_alive*100
Upvotes: 0
Views: 114
Reputation: 256
Using the tidyverse my strategy was to create a dummy logical variable for whether the patient was sick and whether they were alive on a particular date. I then created a function that summarised the number of sick and alive patients on that date. Finally I mapped the function over all the dates of interest.
library(tidyverse)
ex <- data.frame(
id=seq(1:10),
birth=as.Date(c("12/01/2020", "12/01/2020", "12/01/2020", "12/02/2020", "12/02/2020",
"12/02/2020", "12/03/2020", "12/04/2020", "12/04/2020", "12/04/2020"), "%d/%m/%Y"),
sick=as.Date(c("12/03/2020", "12/04/2020", "12/02/2020", "12/03/2020", "12/06/2020",
NA, "12/06/2020", "12/07/2020", "12/09/2020", NA), "%d/%m/%Y"),
death=as.Date(c("12/05/2020", "12/05/2020", "12/04/2020", "12/08/2020", "12/07/2020",
NA, "12/07/2020", "12/09/2020", "12/10/2020", NA), "%d/%m/%Y"),
gender=c("male", "male", "female", "female", "female", "male", "female", "male", "male", "male")
)
d <- tibble(date = as.Date(c("12/01/2020", "12/02/2020", "12/03/2020", "12/04/2020" ,
"12/05/2020", "12/06/2020", "12/07/2020", "12/08/2020",
"12/09/2020", "12/10/2020"), "%d/%m/%Y") )
smry <- function(d) {
ex %>% mutate(alive_now = (death >= d | is.na(death) ),
sick_now = (sick <= d & !is.na(sick))) %>%
summarise(count_alive = sum(alive_now),
count_sick = sum(sick_now))
}
d %>% mutate(s = map(date, smry)) %>% unnest(s)
This gave:
# A tibble: 10 x 3
date count_alive count_sick
<date> <int> <int>
1 2020-01-12 10 0
2 2020-02-12 10 1
3 2020-03-12 10 3
4 2020-04-12 10 4
5 2020-05-12 9 4
6 2020-06-12 7 6
7 2020-07-12 7 7
8 2020-08-12 5 7
9 2020-09-12 4 8
10 2020-10-12 3 8
This is not a full answer but it should start you off.
An alternative strategy without the map is to have one copy of the data associated with each date. You can then group_by the date and summarise.
d <- d %>% mutate(dd = list(ex)) %>% unnest(dd) %>%
mutate(alive_now = (death >= date| is.na(death) ),
sick_now = (sick <= date & !is.na(sick)))
d %>% group_by(date) %>%
summarise(count_alive = sum(alive_now),
count_sick = sum(sick_now))
Upvotes: 1