Reputation: 13
I have a large data data set of patient encounters(~6 million). Each patient may have multiple entries each year over multiple years. I would like to be able arrange the patients according to years and then number them so that I can filter out all but one year for each patient so that I can look at each patient for their first year in a particular health plan.
I am able to rank and filter out the first entry for each patient however I thought I would have to create a new df and subset original data frame based on the two columns generated in my new data frame using %in%. This is where I am having trouble.
While I use stack overflow frequently to find solutions to my questions, I do not typically post so bear with me if I am not doing it properly.
enrolid<- c(223801,223801, 223801, 223801, 223801, 223803, 223803, 223804)
year<- c(2008, 2008, 2009, 2010, 2011, 2008, 2011, 2008)
service<- c( "CT", "Colonoscopy", "labs", "office_visit", "med", "office_vist", "hospitalization", "CT")
#But for 6 million enounters. I want to me extract the enrolid and first #year for each individual in my data set.
df1<-data.frame(enrolid, year, service)
df2<- df1 %>%
group_by(enrolid) %>%
filter(rank(year, ties.method="first")==1) %>%
mutate(enrollment_year_num = 1) %>%
select(enrolid, year)`
df1 %>%
filter_all(any_vars(. %in% df2)) #tried with df2$enrolid & df2year
Thnaks!
Upvotes: 0
Views: 43
Reputation: 14774
Could also use slice
:
df1 %>% group_by(enrolid) %>% slice(which.min(year))
Upvotes: 0
Reputation: 1285
You can just do it all in one step with the filter statement (make sure year
is a numeric variable for this to work).
df1 %>%
group_by(enrolid) %>%
filter(year == min(year))
Upvotes: 1