Test for set inclusion and processing data simultaneously in tidyverse

Question

I almost have what I need. I need some help with the last detail! The data set is produced by the following:

stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))

test.dat1 <- as.data.frame(cbind(Student, College))

I am using the following code to create what I need

library(dplyr)

set.seed(29)
test.dat2 <- test.dat1 %>% 
  group_by(Student, .drop=F) %>% #group by student
  mutate(semester= sequence(n())) %>% #set semester sequence
  summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
            seq_home=min(which(College %in% sctcs)), # add column of sequence values
            new_school= if_else(n_distinct(College) > 1, 
            first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student.

it produces the following table

I want the NA's to be filled in with the last college for that student. I don't know how to get rid of the NA's. If you know an easier way to produce the same thing please share the knowledge.

Captain Hat · Accepted Answer

This ought to do it:

test.dat2 <- test.dat1 |> 
  mutate(semester= sequence(n())) |>
  arrange(Student, semester) |> # find this a more intuitive order
  group_by(Student, .drop=F) |>
  # Additional mutate step for clarity & simplicity
  mutate(seq_home = min(which(College %in% sctcs))) |>
  summarise(home_school = College[seq_home],
            new_school = 
              College[
                coalesce(
                  first(which(!(College %in% sctcs) & semester > seq_home)),
                  seq_home,
                  length(College))
                  ]
            )

We're indexing College with coalesce(), which returns the first non-missing value from it's arguments. Initially, we look for first non-sctcs college they attended after attending home_school. If that returns NA (i.e. there is no such college), we just return seq_home, to get the last sctcs college they attended. If that returns NA (as would be the case if they had never attended any sctcs colleges), we return length(College), which of course subsets College to give us the last college they attended.

I'm still not 100% clear on whether this does exactly what you want - I don't know if you'd considered the case where there were no sctcs colleges. There are none on this seed, but it could easily have happened.

Test for set inclusion and processing data simultaneously in tidyverse

Answers (2)

Related Questions