Reputation: 145
I have a data frame in R with student grade data. Here's a sample:
df <- data.frame(
"courseno" = c(101, 101, 101, 236, 236, 236, 101, 101, 101),
"termID" = c("T", "E", "S", "T", "E", "S", "T", "E", "S"),
"studentID" = c(9100, 9100, 9100, 8640, 8640, 8640, 3461, 3461, 3461),
"grade" = c(92, 79, 91, 83, 87, 84, 78, 92, 83))
## courseno termID studentID grade
## 1 101 T 9100 92
## 2 101 E 9100 79
## 3 101 S 9100 90
## 4 236 T 8640 83
## 5 236 E 8640 87
## 6 236 S 8640 84
## 7 101 T 3461 78
## 8 101 E 3461 92
## 9 101 S 3461 83
For each course, a student earns 3 grades: T (the grade on everything prior to the final exam), E (the grade on the final exam), and S (the ultimate grade in the course). The relationship is S = 0.8T + 0.2E. However, some teachers "bump up" a student's final S grade if the student is, e.g., on the border between an A and B.
For each unique combination of courseno and studentID, I'm trying to calculate the theoretical value of S and compare it to the actual value of S that the teacher awarded. I'm storing the results of these calculations (along with the corresponding courseno & studentID) in a new data frame with 1 line per unique courseno+studentID combination.
I've achieved the desired output, but it's really slow. I feel like there should be a quicker way using dplyr. Here's my current method:
library(dplyr)
# create a code to use for looking up each unique combination of student & course
df <- df %>% mutate(course.student = paste0(courseno, studentID))
# generate master lists that will be populated by the for loop and used to build the new df
courseno.all <- NULL
studentID.all <- NULL
t.awarded.all <- NULL
e.awarded.all <- NULL
s.awarded.all <- NULL
s.calculated.all <- NULL
difference.all <- NULL
for(i in unique(df$course.student)){
# store info for the ith record in a list so it can be added to the new df
courseno.i <- df[df$course.student==i & df$termID == "S","courseno"]
studentID.i <- df[df$course.student==i & df$termID == "S","studentID"]
t.awarded.i <- df[df$course.student==i & df$termID == "T","grade"]
e.awarded.i <- df[df$course.student==i & df$termID == "E","grade"]
s.awarded.i <- df[df$course.student==i & df$termID == "S","grade"]
# calculate a value of S using the equation S = 0.8T + 0.2E
s.calculated.i <- round((t.awarded.i*0.8 + e.awarded.i*0.2),0)
# calculate the amount of "grade bump" the teacher gave
difference.i <- s.awarded.i - s.calculated.i
# combine the ith values into master lists with all other i records
courseno.all <- c(courseno.all, courseno.i)
studentID.all <- c(studentID.all, studentID.i)
t.awarded.all <- c(t.awarded.all, t.awarded.i)
e.awarded.all <- c(e.awarded.all, e.awarded.i)
s.awarded.all <- c(s.awarded.all, s.awarded.i)
s.calculated.all <- c(s.calculated.all, s.calculated.i)
difference.all <- c(difference.all, difference.i)
}
# use the master lists to build a new data frame
df.new <- data.frame(courseno = courseno.all,
studentID = studentID.all,
t.awarded = t.awarded.all,
e.awarded = e.awarded.all,
s.awarded = s.awarded.all,
s.calculated = s.calculated.all,
difference = difference.all)
This gives the desired output, which is:
## courseno studentID t.awarded e.awarded s.awarded s.calculated difference
## 1 101 9100 92 80 91 90 1
## 2 236 8640 83 87 84 84 0
## 3 101 3461 78 92 83 81 2
But my actual data frame is over 100,000 rows and a few dozen columns. Is there a quicker way to achieve this using dplyr and mutate()?
Upvotes: 1
Views: 318
Reputation: 886938
We could use dcast
from data.table
and it should be efficient for bigger dataset
library(data.table)
dcast(setDT(df), courseno + studentID ~ termID, value.var = 'grade')[,
S_calculated := 0.8 * T + 0.2 * E][, difference := S - S_calculated][]
# courseno studentID E S T S_calculated difference
#1: 101 3461 92 83 78 80.8 2.2
#2: 101 9100 79 91 92 89.4 1.6
#3: 236 8640 87 84 83 83.8 0.2
Upvotes: 1
Reputation: 145755
I think this is what you're after:
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = termID, values_from = grade) %>%
mutate(
S_calculated = 0.8 * T + 0.2 * E,
difference = S - S_calculated
)
# # A tibble: 3 x 7
# courseno studentID T E S S_calculated difference
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 9100 92 79 91 89.4 1.60
# 2 236 8640 83 87 84 83.8 0.200
# 3 101 3461 78 92 83 80.8 2.20
Upvotes: 2