Reputation: 2101
I have a dataframe containing - for each monthly cohort of patients - the proportion of patients retained in the study after n months.
df <- tibble::tribble(
~cohort, ~months_elapsed, ~act_patients,
"2020-06", 1, 100,
"2020-05", 1, 100,
"2020-05", 2, 80,
"2020-05", 3, 20,
"2020-04", 1, 100,
"2020-04", 2, 80,
"2020-04", 3, 50,
"2020-04", 4, 10
)
# A tibble: 8 x 3
cohort months_elapsed act_patients
<chr> <dbl> <dbl>
1 2020-06 1 100
2 2020-05 1 100
3 2020-05 2 80
4 2020-05 3 20
5 2020-04 1 100
6 2020-04 2 80
7 2020-04 3 50
8 2020-04 4 10
Unfortunately, the dataframe contains inconsistent rows I want to clean out as follows:
Let's imagine we are at the end of the month of June 2020.
The cohort 2020-06, after the current month has elapsed, has the totality (100%) of the patients active in the study.
The cohort of 2020-05, had 100% of the patients after a month elapsed, 80% of the patients after the second month elapsed. But here I have an extra row, containing the % of active patients of a month that didn't elapse yet (the 3rd month). Being at the end of June 2020, only 2 months have fully elapsed (May and June)
Same for the cohort of 2020-04, I have an extra row (the 4th month) I want to clean out: the cohort of April 2020, at the end of June, had just 3 possible months that elapsed (April, May, June).
The final - clean - dataframe I want to have is:
df_clean <- tibble::tribble(
~cohort, ~months_elapsed, ~act_patients,
"2020-06", 1, 100,
"2020-05", 1, 100,
"2020-05", 2, 80,
"2020-04", 1, 100,
"2020-04", 2, 80,
"2020-04", 3, 50
)
# A tibble: 6 x 3
cohort months_elapsed act_patients
<chr> <dbl> <dbl>
1 2020-06 1 100
2 2020-05 1 100
3 2020-05 2 80
4 2020-04 1 100
5 2020-04 2 80
6 2020-04 3 50
I am looking for a possible solution in dplyr, any help is highly appreciated!!
Upvotes: 2
Views: 342
Reputation: 30474
I believe this might be what you're looking for. First create a date column from your year-month cohort
. Then, you can filter
and select rows where the current date is past the cohort_date
plus the number of months that have elapsed.
library(tidyverse)
library(lubridate)
df$cohort_date <- ymd(paste0(df$cohort, "-01"))
filter(df, Sys.Date() > cohort_date + (months(months_elapsed - 1)))
Output
# A tibble: 6 x 4
cohort months_elapsed act_patients cohort_date
<chr> <dbl> <dbl> <date>
1 2020-06 1 100 2020-06-01
2 2020-05 1 100 2020-05-01
3 2020-05 2 80 2020-05-01
4 2020-04 1 100 2020-04-01
5 2020-04 2 80 2020-04-01
6 2020-04 3 50 2020-04-01
Upvotes: 1
Reputation: 27742
Here is my tidy solution, using %m+%
from the lubridate
-package.
library( tidyverse )
library( lubridate )
filter_date = as.Date( "2020-07-01" )
df %>%
filter( as.Date( paste0( cohort, "-01") ) %m+% months( months_elapsed ) <= filter_date )
# # A tibble: 6 x 3
# cohort months_elapsed act_patients
# <chr> <dbl> <dbl>
# 1 2020-06 1 100
# 2 2020-05 1 100
# 3 2020-05 2 80
# 4 2020-04 1 100
# 5 2020-04 2 80
# 6 2020-04 3 50
Upvotes: 1