Reputation: 473
I have a data set that looks like this (a nonsense example):
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
event<- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- cbind(id, year, event)
There are suppose to be continuous observations for all three id's between 1989 until death. However, as you can see id 1 is left-censored (no information from start), id 2 is right-censored (no info from start or finish), and id 3 have gaps in observation (info from start and finish but with gaps). In a small table this is easy to see, but when dealing with large data sets it becomes more difficult.
Edit: Is there a way of grouping by id and creating a summary table with information on the completeness of the data, something like:
id left-censored right-censored gaps in obs.
1 1 0 0
2 0 1 0
3 0 0 1
Upvotes: 0
Views: 577
Reputation: 472
You can group (I use dplyr) your data.frame (I employ tibble) by ID and then create new variables that indiciate whether or not for each ID the first year of observation was 1989, whether the person died under observation and whether or not the number of rows per ID is equal to the time span (max_year - min_year + 1). In this case I would argue that ID 2 is not left censored, as her first year of observation is 1989 which you define as starting year.
library(tibble)
library(dplyr)
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
deceased <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- tibble(id, year, deceased)
start_year <- 1989
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) ## has gaps,
The result:
# A tibble: 12 x 6
# Groups: id [3]
id year deceased left_censored right_censored has_gaps
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1990 0 TRUE FALSE FALSE
2 1 1991 0 TRUE FALSE FALSE
3 1 1992 1 TRUE FALSE FALSE
4 2 1989 0 FALSE TRUE FALSE
5 2 1990 0 FALSE TRUE FALSE
6 2 1991 0 FALSE TRUE FALSE
7 2 1992 0 FALSE TRUE FALSE
8 2 1993 0 FALSE TRUE FALSE
9 3 1989 0 FALSE FALSE TRUE
10 3 1990 0 FALSE FALSE TRUE
11 3 1992 0 FALSE FALSE TRUE
12 3 1993 1 FALSE FALSE TRUE
Edit: If you want an overview you can add:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
ungroup() %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
And get:
# A tibble: 1 x 3
left_censored right_censored has_gaps
<int> <int> <int>
1 1 1 1
As I mentioned before: Here ID 2 is not considered left censored, as her starting date is 1989.
Edit2: If you take away the ungroup() you get the overview you asked for:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
and get:
id left_censored right_censored has_gaps
<dbl> <int> <int> <int>
1 1 1 0 0
2 2 0 1 0
3 3 0 0 1
Upvotes: 1