Reputation: 9
I have a data frame arranged along these lines:
theta | Rater | Case1 | Case2 | ... | CaseN |
---|---|---|---|---|---|
theta1 | rater1 | score1 | score2 | ... | scoreN |
theta2 | rater1 | score1 | score4 | ... | scoreN |
theta1 | rater2 | score1 | score2 | ... | scoreN |
... | ... | ... | ... | ... | ... |
thetaN | raterN | score1 | score2 | ... | scoreN |
I need to correlate each rater's collection of scores with the same rater's collection of thetas. The problem is that thetas vary by column, scores vary by column AND row, and raters can appear in multiple rows. That is, each rater has multiple scores (spread across columns AND rows), and each rater is also associated with multiple thetas.
df<-data.frame(theta = c(2.04363195930878, 2.04363195930878, 2.92270245709173,
2.92270245709173, 0.200236377489036),
Rater = c(77, 71, 58, 21, 86),
`Case1` = c(NA, NA, 3, 3, NA),
`Case2` = c(NA, NA, NA, NA, 1),
...
`CaseN` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)),
I've tried a couple of different strategies to manipulate the data into the form I need (essentially, a vector of thetas and a vector of scores for each rater), but I keep confusing myself and getting nowhere.
Upvotes: 0
Views: 56
Reputation: 69221
I'm not sure I fully followed, but I believe the following achieves what you are after leveraging melt
from data.table
and then the by
operator:
#illustrative data
set.seed(42)
dat <- data.frame(theta = rnorm(6),
Rater = c("Rater1", "Rater1", "Rater2", "Rater2", "Rater3", "Rater3"),
Case1 = rnorm(6),
Case2 = rnorm(6),
Case3 = rnorm(6)
)
#convert to data.table
library(data.table)
setDT(dat)
#melt into longer format keeping integrity of the theta scores and Rater
dat_melt <- melt(dat, id.vars = 1:2)
#calculate correlation between theta and value (the column name for the long-form of Cases)
dat_melt[, .(cor = cor(theta, value)), by = Rater]
Which will give you an output of:
Rater cor
<char> <num>
1: Rater1 -0.3898634362
2: Rater2 -0.4087852545
3: Rater3 0.0004029269
Upvotes: 0
Reputation: 20399
You could use summarize
from dplyr
and group_by
"rater". across
helps you to calculate the correlation for each case
:
library(dplyr)
set.seed(20241104)
cases_n <- 10L
m <- 100L
rater_n <- 5L
(df <- rnorm((cases_n + 1L) * m) %>%
matrix(nrow = m) %>%
as.data.frame() %>%
setNames(c("theta", paste("case", 1:cases_n, sep = "_"))) %>%
as_tibble() %>%
mutate(rater = sample(paste("rater", 1:rater_n), m, TRUE), .before = 1L))
# # A tibble: 100 × 12
# rater theta case_1 case_2 case_3 case_4 case_5 case_6 case_7 case_8
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 rater 3 -0.478 -1.62 -1.96 -3.16 0.816 -0.719 -0.0381 0.187 -0.392
# 2 rater 2 -0.467 0.349 2.89 0.847 1.85 -0.310 0.546 -0.612 0.171
# 3 rater 2 -1.05 0.270 0.209 2.30 -0.400 -0.0329 0.532 -0.276 -0.391
# 4 rater 5 -0.136 -0.725 0.269 -0.206 -0.757 -1.81 -1.30 -0.900 -0.294
# 5 rater 4 0.248 -1.49 -0.0748 -0.418 1.33 0.459 -1.32 -0.230 -0.662
# 6 rater 5 1.75 -0.729 -0.950 0.862 0.0381 -1.52 0.534 0.0627 1.21
# 7 rater 2 -2.12 -2.79 -0.137 -1.09 1.03 -1.10 -1.19 -1.02 0.290
# 8 rater 4 1.00 0.855 -0.476 0.841 -1.22 -0.841 -1.41 -0.816 -1.18
# 9 rater 1 -0.262 1.00 -0.100 -1.36 0.202 0.523 0.121 -1.24 -0.800
# 10 rater 3 0.409 0.640 1.95 0.763 -0.290 -0.534 0.540 0.446 -1.53
# # ℹ 90 more rows
# # ℹ 2 more variables: case_9 <dbl>, case_10 <dbl>
df %>%
group_by(rater) %>%
summarize(across(starts_with("case"), ~ cor(theta, .x)))
# # A tibble: 5 × 11
# rater case_1 case_2 case_3 case_4 case_5 case_6 case_7 case_8 case_9
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 rater 1 -0.0177 -0.130 -0.0698 -0.0187 0.139 0.0834 -0.127 0.262 0.00358
# 2 rater 2 0.630 0.308 -0.134 0.0220 0.298 0.633 -0.383 -0.491 0.00557
# 3 rater 3 0.217 0.399 0.0253 -0.158 -0.357 0.166 0.0546 0.233 0.0639
# 4 rater 4 0.303 -0.237 0.0498 -0.200 0.0255 0.00624 0.182 -0.359 0.0832
# 5 rater 5 -0.0175 0.230 0.102 0.0774 -0.0120 -0.393 -0.145 0.274 0.230
# # ℹ 1 more variable: case_10 <dbl>
Upvotes: 1