Luis
Luis

Reputation: 1584

Using tidyverse to dynamically mutate one variable from one grouped dataset from another dataset

Let's say I work with different classes (nodes, in my dataset) and I have thousands of students. Each student has their own math score, and I need to compare all individual scores with the group mean/sd. To deal with that, I have two different datasets. The first one is "a table". default table

This data frame is formed of several classes (nodes), their means, and their sd.

I also have another dataset composed of students' results, like this one:

students' results

I want to have another dataset in which I get all individual results (i.e., 11, 6, 10, etc) and subtract this result from all means in the first dataset. In the future, it will be needed to check all results and all nodes together.

In other words, from the first nome (number 12 in the image), I will subtract 11 (student's result) from 68 (mean result), 6 (student's result) from 68 (mean), 10 (student's result) from 68 (mean), etc. Then I'll move for the second node (number 7 in the image), and I will do the same (subtract 11 from 74 (mean result), 6 from 74 (mean result), 10 from 74),

The final output I would like to have is below. Please, take this image as illustrative-only: Final output

Due to questions, to make my question clear, the following command also brings an interesting solution, but it's not functional since I'll have to manually add all columns to the new dataset:

test_result %>% 
  mutate(total_header_node_12 = x - default_table$t_mean[1]) %>% 
  mutate(total_header_node_74 = x - default_table$t_mean[2]) 

Another desired output

I searched for other questions but I did not find any solution. Any help is valuable. I use tidyverse, and I would like to remain within tidyverse environment. Thank you To reproduce:

> dput(default_table)
structure(list(node = structure(c(6L, 3L, 5L, 1L, 2L, 4L, 7L), .Label = c("4", 
"5", "7", "8", "10", "12", "13"), class = "factor"), t_mean = c(68.8219178082192, 
74.3260869565217, 83.0178571428571, 92.2108108108108, 98.3304347826087, 
88.6111111111111, 48.4), t_sd = c(14.4351088961341, 16.9448394654941, 
13.0272663858681, 12.2011483603603, 12.1775472144027, 14.5621088567959, 
10.4876948807826), vars = c(1, 1, 1, 1, 1, 1, 1), n = c(121, 
74, 92, 616, 191, 58, 7), mean = c(68, 74.6891891891892, 82.8369565217391, 
91.3944805194805, 97.738219895288, 88.0172413793103, 48.7142857142857
), sd = c(14.0226008048911, 16.1151045250761, 11.0426517498479, 
12.6758935948866, 12.0212336250146, 15.9169901273025, 8.63547500554709
), min = c(32, 32, 58, 36, 56, 44, 39), max = c(97, 113, 104, 
123, 128, 124, 60), range = c(65, 81, 46, 87, 72, 80, 21), se = c(1.27478189135374, 
1.87334284914993, 1.15127602962793, 0.510726307415094, 0.869825937534791, 
2.09000319547951, 3.26390275965596), q0_25 = c(59, 64, 74.75, 
84, 90, 80, 41.5), q0_5 = c(68, 73.5, 81.5, 92, 98, 87, 47), 
    q0_75 = c(80, 87.75, 92.25, 100, 106, 98.75, 56)), class = "data.frame", row.names = c(NA, 
-7L))


test_result <- data.frame(x = rnorm(100,10,2))

Upvotes: 0

Views: 115

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389275

Probably, you can try :

library(dplyr)

default_table %>%
  tidyr::crossing(test_result) %>%
  mutate(comparative_mean = x - t_mean)

This creates all combinations of default_table with test_result and then we subtract the two values from each other.

Upvotes: 2

Calum You
Calum You

Reputation: 15072

You just need to add the student results as a new column and then unnest to get a long form data that you can do this comparison on. There is probably also a solution with crossing but this seemed simpler. Essentially, we want to insert the student results as a new column in the default_table, and then use unnest to expand so there is one row per node-student, as in the desired output. Then we can simply subtract the mean from the student results.

set.seed(1)
library(tidyverse)
default_table <- structure(list(node = structure(c(6L, 3L, 5L, 1L, 2L, 4L, 7L), .Label = c("4", "5", "7", "8", "10", "12", "13"), class = "factor"), t_mean = c(68.8219178082192, 74.3260869565217, 83.0178571428571, 92.2108108108108, 98.3304347826087, 88.6111111111111, 48.4), t_sd = c(14.4351088961341, 16.9448394654941, 13.0272663858681, 12.2011483603603, 12.1775472144027, 14.5621088567959, 10.4876948807826), vars = c(1, 1, 1, 1, 1, 1, 1), n = c(121, 74, 92, 616, 191, 58, 7), mean = c(68, 74.6891891891892, 82.8369565217391, 91.3944805194805, 97.738219895288, 88.0172413793103, 48.7142857142857), sd = c(14.0226008048911, 16.1151045250761, 11.0426517498479, 12.6758935948866, 12.0212336250146, 15.9169901273025, 8.63547500554709), min = c(32, 32, 58, 36, 56, 44, 39), max = c(97, 113, 104, 123, 128, 124, 60), range = c(65, 81, 46, 87, 72, 80, 21), se = c(1.27478189135374, 1.87334284914993, 1.15127602962793, 0.510726307415094, 0.869825937534791, 2.09000319547951, 3.26390275965596), q0_25 = c(59, 64, 74.75, 84, 90, 80, 41.5), q0_5 = c(68, 73.5, 81.5, 92, 98, 87, 47), q0_75 = c(80, 87.75, 92.25, 100, 106, 98.75, 56)), class = "data.frame", row.names = c(NA, -7L))
test_result <- data.frame(x = rnorm(10,10,2))

test_result <- test_result %>%
  rowid_to_column(var = "student_number") %>%
  rename(student_result = x)

default_table %>%
  mutate(students = map(node, ~ test_result)) %>%
  unnest(students) %>%
  mutate(comparative_mean = student_result - mean) %>%
  select(node, mean, student_number, student_result, comparative_mean)
#> # A tibble: 70 x 5
#>    node   mean student_number student_result comparative_mean
#>    <fct> <dbl>          <int>          <dbl>            <dbl>
#>  1 12       68              1           8.75            -59.3
#>  2 12       68              2          10.4             -57.6
#>  3 12       68              3           8.33            -59.7
#>  4 12       68              4          13.2             -54.8
#>  5 12       68              5          10.7             -57.3
#>  6 12       68              6           8.36            -59.6
#>  7 12       68              7          11.0             -57.0
#>  8 12       68              8          11.5             -56.5
#>  9 12       68              9          11.2             -56.8
#> 10 12       68             10           9.39            -58.6
#> # … with 60 more rows

Created on 2020-02-19 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions