asokol
asokol

Reputation: 129

Test for set inclusion and processing data simultaneously in tidyverse

I almost have what I need. I need some help with the last detail! The data set is produced by the following:

stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))

test.dat1 <- as.data.frame(cbind(Student, College))

I am using the following code to create what I need

library(dplyr)

set.seed(29)
test.dat2 <- test.dat1 %>% 
  group_by(Student, .drop=F) %>% #group by student
  mutate(semester= sequence(n())) %>% #set semester sequence
  summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
            seq_home=min(which(College %in% sctcs)), # add column of sequence values
            new_school= if_else(n_distinct(College) > 1, 
            first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student. 

it produces the following table

enter image description here

I want the NA's to be filled in with the last college for that student. I don't know how to get rid of the NA's. If you know an easier way to produce the same thing please share the knowledge.

Upvotes: 0

Views: 229

Answers (2)

Captain Hat
Captain Hat

Reputation: 3237

This ought to do it:

test.dat2 <- test.dat1 |> 
  mutate(semester= sequence(n())) |>
  arrange(Student, semester) |> # find this a more intuitive order
  group_by(Student, .drop=F) |>
  # Additional mutate step for clarity & simplicity
  mutate(seq_home = min(which(College %in% sctcs))) |>
  summarise(home_school = College[seq_home],
            new_school = 
              College[
                coalesce(
                  first(which(!(College %in% sctcs) & semester > seq_home)),
                  seq_home,
                  length(College))
                  ]
            )

We're indexing College with coalesce(), which returns the first non-missing value from it's arguments. Initially, we look for first non-sctcs college they attended after attending home_school. If that returns NA (i.e. there is no such college), we just return seq_home, to get the last sctcs college they attended. If that returns NA (as would be the case if they had never attended any sctcs colleges), we return length(College), which of course subsets College to give us the last college they attended.

I'm still not 100% clear on whether this does exactly what you want - I don't know if you'd considered the case where there were no sctcs colleges. There are none on this seed, but it could easily have happened.

Upvotes: 1

Captain Hat
Captain Hat

Reputation: 3237

It's not clear what you're trying to do. But when [!(College %in% sctcs) & semester > seq_home] returns FALSE, College[!(College %in% sctcs) & semester > seq_home] returns a zero-length character vector, so first(College[!(College %in% sctcs) & semester > seq_home]) returns NA.

When there are no TRUE values in [!(College %in% sctcs) & semester > seq_home], it's because there are no non-sctcs colleges in any of the semesters after semester[seq_home]. If a student transfers from home_school to one or more sctcs schools, but never to any non-sctcs schools, you'll get an NA value.

You're effectively asking the wrong question. I'm not sure what question you're trying to ask, but what you're currently asking is:

What's the first non-sctcs school this student attended after they attended their first sctcs school?

Some students, however, never attend a non-sctcs school after attending their first sctcs school. For this reason, you get an NA response, which is the correct answer to the question.

Upvotes: 1

Related Questions