jdcode
jdcode

Reputation: 145

In R, how to multiply multiple rows for each unique entry in a different variable

I have a data frame in R with student grade data. Here's a sample:

df <- data.frame(
  "courseno" = c(101, 101, 101, 236, 236, 236, 101, 101, 101),
  "termID" = c("T", "E", "S", "T", "E", "S", "T", "E", "S"),
  "studentID" = c(9100, 9100, 9100, 8640, 8640, 8640, 3461, 3461, 3461),
  "grade" = c(92, 79, 91, 83, 87, 84, 78, 92, 83))

##        courseno termID studentID grade
## 1           101     T       9100    92
## 2           101     E       9100    79
## 3           101     S       9100    90

## 4           236     T       8640    83
## 5           236     E       8640    87
## 6           236     S       8640    84

## 7           101     T       3461    78
## 8           101     E       3461    92
## 9           101     S       3461    83

For each course, a student earns 3 grades: T (the grade on everything prior to the final exam), E (the grade on the final exam), and S (the ultimate grade in the course). The relationship is S = 0.8T + 0.2E. However, some teachers "bump up" a student's final S grade if the student is, e.g., on the border between an A and B.

For each unique combination of courseno and studentID, I'm trying to calculate the theoretical value of S and compare it to the actual value of S that the teacher awarded. I'm storing the results of these calculations (along with the corresponding courseno & studentID) in a new data frame with 1 line per unique courseno+studentID combination.

I've achieved the desired output, but it's really slow. I feel like there should be a quicker way using dplyr. Here's my current method:

library(dplyr)

# create a code to use for looking up each unique combination of student & course
df <- df %>% mutate(course.student = paste0(courseno, studentID))

# generate master lists that will be populated by the for loop and used to build the new df
courseno.all <- NULL
studentID.all <- NULL
t.awarded.all <- NULL
e.awarded.all <- NULL
s.awarded.all <- NULL
s.calculated.all <- NULL
difference.all <- NULL

for(i in unique(df$course.student)){
  # store info for the ith record in a list so it can be added to the new df
  courseno.i <- df[df$course.student==i & df$termID == "S","courseno"]
  studentID.i <- df[df$course.student==i & df$termID == "S","studentID"]
  t.awarded.i <- df[df$course.student==i & df$termID == "T","grade"]
  e.awarded.i <- df[df$course.student==i & df$termID == "E","grade"]
  s.awarded.i <- df[df$course.student==i & df$termID == "S","grade"]

  # calculate a value of S using the equation S = 0.8T + 0.2E
  s.calculated.i <- round((t.awarded.i*0.8 + e.awarded.i*0.2),0)

  # calculate the amount of "grade bump" the teacher gave
  difference.i <- s.awarded.i - s.calculated.i

  # combine the ith values into master lists with all other i records
  courseno.all <- c(courseno.all, courseno.i)
  studentID.all <- c(studentID.all, studentID.i)
  t.awarded.all <- c(t.awarded.all, t.awarded.i)
  e.awarded.all <- c(e.awarded.all, e.awarded.i)
  s.awarded.all <- c(s.awarded.all, s.awarded.i)
  s.calculated.all <- c(s.calculated.all, s.calculated.i)
  difference.all <- c(difference.all, difference.i)
}

# use the master lists to build a new data frame

df.new <- data.frame(courseno = courseno.all, 
                     studentID = studentID.all, 
                     t.awarded = t.awarded.all, 
                     e.awarded = e.awarded.all, 
                     s.awarded = s.awarded.all, 
                     s.calculated = s.calculated.all,
                     difference = difference.all)

This gives the desired output, which is:

##     courseno   studentID   t.awarded   e.awarded   s.awarded   s.calculated   difference
## 1   101        9100         92         80          91          90             1
## 2   236        8640         83         87          84          84             0
## 3   101        3461         78         92          83          81             2

But my actual data frame is over 100,000 rows and a few dozen columns. Is there a quicker way to achieve this using dplyr and mutate()?

Upvotes: 1

Views: 318

Answers (2)

akrun
akrun

Reputation: 886938

We could use dcast from data.table and it should be efficient for bigger dataset

library(data.table)
dcast(setDT(df), courseno + studentID ~ termID, value.var = 'grade')[,
 S_calculated := 0.8 * T + 0.2 * E][, difference := S - S_calculated][]
#    courseno studentID  E  S  T S_calculated difference
#1:      101      3461 92 83 78         80.8        2.2
#2:      101      9100 79 91 92         89.4        1.6
#3:      236      8640 87 84 83         83.8        0.2

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145755

I think this is what you're after:

library(tidyr)
library(dplyr)

df %>% 
  pivot_wider(names_from = termID, values_from = grade) %>%
  mutate(
    S_calculated = 0.8 * T + 0.2 * E,
    difference = S - S_calculated
  )
# # A tibble: 3 x 7
#   courseno studentID     T     E     S S_calculated difference
#      <dbl>     <dbl> <dbl> <dbl> <dbl>        <dbl>      <dbl>
# 1      101      9100    92    79    91         89.4      1.60 
# 2      236      8640    83    87    84         83.8      0.200
# 3      101      3461    78    92    83         80.8      2.20 

Upvotes: 2

Related Questions