Reputation: 129
I almost have what I need. I need some help with the last detail! The data set is produced by the following:
stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))
test.dat1 <- as.data.frame(cbind(Student, College))
I am using the following code to create what I need
library(dplyr)
set.seed(29)
test.dat2 <- test.dat1 %>%
group_by(Student, .drop=F) %>% #group by student
mutate(semester= sequence(n())) %>% #set semester sequence
summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
seq_home=min(which(College %in% sctcs)), # add column of sequence values
new_school= if_else(n_distinct(College) > 1,
first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student.
it produces the following table
I want the NA's to be filled in with the last college for that student. I don't know how to get rid of the NA's. If you know an easier way to produce the same thing please share the knowledge.
Upvotes: 0
Views: 229
Reputation: 3237
This ought to do it:
test.dat2 <- test.dat1 |>
mutate(semester= sequence(n())) |>
arrange(Student, semester) |> # find this a more intuitive order
group_by(Student, .drop=F) |>
# Additional mutate step for clarity & simplicity
mutate(seq_home = min(which(College %in% sctcs))) |>
summarise(home_school = College[seq_home],
new_school =
College[
coalesce(
first(which(!(College %in% sctcs) & semester > seq_home)),
seq_home,
length(College))
]
)
We're indexing College with coalesce()
, which returns the first non-missing value from it's arguments. Initially, we look for first non-sctcs college they attended after attending home_school
. If that returns NA
(i.e. there is no such college), we just return seq_home
, to get the last sctcs college they attended. If that returns NA
(as would be the case if they had never attended any sctcs colleges), we return length(College)
, which of course subsets College to give us the last college they attended.
I'm still not 100% clear on whether this does exactly what you want - I don't know if you'd considered the case where there were no sctcs colleges. There are none on this seed, but it could easily have happened.
Upvotes: 1
Reputation: 3237
It's not clear what you're trying to do. But when [!(College %in% sctcs) & semester > seq_home]
returns FALSE
, College[!(College %in% sctcs) & semester > seq_home]
returns a zero-length character vector, so first(College[!(College %in% sctcs) & semester > seq_home])
returns NA.
When there are no TRUE
values in [!(College %in% sctcs) & semester > seq_home]
, it's because there are no non-sctcs colleges in any of the semesters after semester[seq_home]
. If a student transfers from home_school
to one or more sctcs schools, but never to any non-sctcs schools, you'll get an NA
value.
You're effectively asking the wrong question. I'm not sure what question you're trying to ask, but what you're currently asking is:
What's the first non-sctcs school this student attended after they attended their first sctcs school?
Some students, however, never attend a non-sctcs school after attending their first sctcs school. For this reason, you get an NA
response, which is the correct answer to the question.
Upvotes: 1