Reputation: 19
I have csv file that I am reading into R. The data contains courses that had 3 hour exams in the last 3 years. What I am trying to do is check all current courses and see if they had a 3 hour exam within the last 3yrs
data looks like this:
courseid subject catalog first_year enrollment_more term
124923 AESTHINT 15 NA NA 2188
110744 AESTHINT 24 NA NA 2168
122277 AESTHINT 26 NA NA 2182
123913 AESTHINT 36 NA NA 2162
123913 AESTHINT 36 NA YES 2172
127928 AESTHINT 47 NA NA 2168
127022 AESTHINT 51 NA NA 2178
108455 AESTHINT 55 NA NA 2168
109398 AESTHINT 56 NA NA 2172
109501 AESTHINT 57 NA YES 2178
108950 AESTHINT 58 NA NA 2168
108950 AESTHINT 58 NA NA 2188
108580 AESTHINT 59 NA NA 2168
108580 AESTHINT 59 NA NA 2188
122910 AFRAMER 10 NA YES 2172
122910 AFRAMER 10 NA YES 2182
123435 AFRAMER 197 NA NA 2162
123435 AFRAMER 197 NA NA 2172
159801 AKKAD AB NA NA 2172
114298 ANE 102 NA NA 2178
203480 ANE 145 NA NA 2168
112378 ANTHRO 1010 NA NA 2168
112378 ANTHRO 1010 NA NA 2178
Expected output is something like this
courseid subject catalog first_year enrollment_more term
108950 AESTHINT 58 NA NA 2168
108950 AESTHINT 58 NA NA 2188
108580 AESTHINT 59 NA NA 2168
108580 AESTHINT 59 NA NA 2188
I load the data and filter by term (2188 is current term and the rest is prior terms). I tried match_df but that gives me the same results as current_term
# read the data
df <- read.csv("my_data.csv", header = TRUE, fileEncoding = "UTF-8-BOM")
# filter all courses with term == 2188
current_term <- filter(df, term == 2188)
current_term
# filter all courses that do not have term == 2188
prior_terms <- filter(df, term %in% c(2182, 2178, 2172, 2168, 2162))
prior_terms
#compare current_term and prior_term on courseid
compare_terms <- match_df(current_term, prior_terms, on="courseid")
compare_terms
For example AESTHINT 24 should not be in my comparison result because it did not have an exam in term 2188
Upvotes: 0
Views: 52
Reputation: 116
Working in the plyr library with match_df
is essentially the same as subset of a data frame here. Let's read in a csv of your sample data and setup two data frames.
data <- read.csv("sample.csv")
current_term <- 2188
`%notin%` <- function(x,y) !(x %in% y)
current_classes <- subset(data, data$term %in% current_term)
previous_classes <- subset(data, data$term %notin% current_term)
For your case I am going to use subset
because it requires one less package to be loaded and in this example it is interchangeable with filter
. However, if you are going to do further transformations, use very large data sets, or use this function within a pipe go ahead and load dplyr
use filter
. Also, I have set the current term as an object and created the %notin%
function to make this easier to automate over other terms.
Now to stick with the match_df
function you would need to rewrite it as:
match_df(previous_classes, current_classes, on="courseid")
The first data frame in the match_df
function is the data frame to subset, and the second data frame is the one that defines the rows to match. I would just use filtering or subsetting here to accomplish your goal. Also, in your output if you want a row for each class that fits your criteria in both the current term and any previous term then you will have to bind them back together.
The following code will produce the answer that you are looking for without having to load any packages:
data <- read.csv("sample.csv")
current_term <- 2188
`%notin%` <- function(x,y) !(x %in% y)
current_classes <- subset(data, data$term %in% current_term)
previous_classes <- subset(data, data$term %notin% current_term)
dat1 <- subset(previous_classes, previous_classes$courseid %in%
current_classes$courseid)
dat2 <- subset(current_classes, current_classes$courseid %in%
previous_classes$courseid)
final <- rbind(dat1, dat2)
Just replace the csv name with whatever your file is called.
Upvotes: 1