Alpha
Alpha

Reputation: 19

Returning rows in dataframe by comparing on a column

I have csv file that I am reading into R. The data contains courses that had 3 hour exams in the last 3 years. What I am trying to do is check all current courses and see if they had a 3 hour exam within the last 3yrs

data looks like this:

courseid    subject catalog first_year  enrollment_more         term
124923      AESTHINT    15      NA          NA                  2188
110744      AESTHINT    24      NA          NA                  2168
122277      AESTHINT    26      NA          NA                  2182
123913      AESTHINT    36      NA          NA                  2162
123913      AESTHINT    36      NA         YES                  2172
127928      AESTHINT    47      NA          NA                  2168
127022      AESTHINT    51      NA          NA                  2178
108455      AESTHINT    55      NA          NA                  2168
109398      AESTHINT    56      NA          NA                  2172
109501      AESTHINT    57      NA          YES                 2178
108950      AESTHINT    58      NA          NA                  2168
108950      AESTHINT    58      NA          NA                  2188
108580      AESTHINT    59      NA          NA                  2168
108580      AESTHINT    59      NA          NA                  2188
122910      AFRAMER     10      NA          YES                 2172
122910      AFRAMER     10      NA          YES                 2182
123435      AFRAMER     197     NA          NA                  2162
123435      AFRAMER     197     NA          NA                  2172
159801      AKKAD       AB      NA          NA                  2172
114298      ANE         102     NA          NA                  2178
203480      ANE         145     NA          NA                  2168
112378      ANTHRO      1010    NA          NA                  2168
112378      ANTHRO      1010    NA          NA                  2178

Expected output is something like this

courseid    subject catalog first_year  enrollment_more         term
108950      AESTHINT    58      NA          NA                  2168
108950      AESTHINT    58      NA          NA                  2188
108580      AESTHINT    59      NA          NA                  2168
108580      AESTHINT    59      NA          NA                  2188

I load the data and filter by term (2188 is current term and the rest is prior terms). I tried match_df but that gives me the same results as current_term

# read the data 
df <- read.csv("my_data.csv", header = TRUE, fileEncoding = "UTF-8-BOM")


# filter all courses with term == 2188
current_term <- filter(df, term == 2188)
current_term

# filter all courses that do not have term == 2188
prior_terms <- filter(df, term %in% c(2182, 2178, 2172, 2168, 2162))
prior_terms

#compare current_term and prior_term on courseid
compare_terms <- match_df(current_term, prior_terms, on="courseid")
compare_terms

For example AESTHINT 24 should not be in my comparison result because it did not have an exam in term 2188

Upvotes: 0

Views: 52

Answers (1)

Darren Frye
Darren Frye

Reputation: 116

Working in the plyr library with match_df is essentially the same as subset of a data frame here. Let's read in a csv of your sample data and setup two data frames.

data <- read.csv("sample.csv")
current_term <- 2188

`%notin%` <- function(x,y) !(x %in% y)

current_classes <- subset(data, data$term %in% current_term)

previous_classes <- subset(data, data$term %notin% current_term)

For your case I am going to use subset because it requires one less package to be loaded and in this example it is interchangeable with filter. However, if you are going to do further transformations, use very large data sets, or use this function within a pipe go ahead and load dplyr use filter. Also, I have set the current term as an object and created the %notin% function to make this easier to automate over other terms.

Now to stick with the match_df function you would need to rewrite it as:

match_df(previous_classes, current_classes, on="courseid")

The first data frame in the match_df function is the data frame to subset, and the second data frame is the one that defines the rows to match. I would just use filtering or subsetting here to accomplish your goal. Also, in your output if you want a row for each class that fits your criteria in both the current term and any previous term then you will have to bind them back together.

The following code will produce the answer that you are looking for without having to load any packages:

data <- read.csv("sample.csv")
current_term <- 2188

`%notin%` <- function(x,y) !(x %in% y)

current_classes <- subset(data, data$term %in% current_term)

previous_classes <- subset(data, data$term %notin% current_term)

dat1 <- subset(previous_classes, previous_classes$courseid %in% 
               current_classes$courseid)

dat2 <- subset(current_classes, current_classes$courseid %in% 
               previous_classes$courseid)

final <- rbind(dat1, dat2)

Just replace the csv name with whatever your file is called.

Upvotes: 1

Related Questions