Reputation: 9

Correlation based on mutliple columns and rows

I have a data frame arranged along these lines:

theta	Rater	Case1	Case2	...	CaseN
theta1	rater1	score1	score2	...	scoreN
theta2	rater1	score1	score4	...	scoreN
theta1	rater2	score1	score2	...	scoreN
...	...	...	...	...	...
thetaN	raterN	score1	score2	...	scoreN

I need to correlate each rater's collection of scores with the same rater's collection of thetas. The problem is that thetas vary by column, scores vary by column AND row, and raters can appear in multiple rows. That is, each rater has multiple scores (spread across columns AND rows), and each rater is also associated with multiple thetas.

df<-data.frame(theta = c(2.04363195930878, 2.04363195930878, 2.92270245709173, 
     2.92270245709173, 0.200236377489036),  
     Rater = c(77, 71, 58, 21, 86), 
     `Case1` = c(NA, NA, 3, 3, NA), 
     `Case2` = c(NA, NA, NA, NA, 1), 
     ... 
     `CaseN` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)),

I've tried a couple of different strategies to manipulate the data into the form I need (essentially, a vector of thetas and a vector of scores for each rater), but I keep confusing myself and getting nowhere.

Upvotes: 0

Answers (2)

Chase

Reputation: 69221

I'm not sure I fully followed, but I believe the following achieves what you are after leveraging melt from data.table and then the by operator:

#illustrative data
set.seed(42)
dat <- data.frame(theta = rnorm(6),
                  Rater = c("Rater1", "Rater1", "Rater2", "Rater2", "Rater3", "Rater3"),
                  Case1 = rnorm(6),
                  Case2 = rnorm(6),
                  Case3 = rnorm(6)
)

#convert to data.table
library(data.table)
setDT(dat)

#melt into longer format keeping integrity of the theta scores and Rater
dat_melt <- melt(dat, id.vars = 1:2)

#calculate correlation between theta and value (the column name for the long-form of Cases)
dat_melt[, .(cor = cor(theta, value)), by = Rater]

Which will give you an output of:

   Rater           cor
   <char>         <num>
1: Rater1 -0.3898634362
2: Rater2 -0.4087852545
3: Rater3  0.0004029269

Upvotes: 0

thothal

Reputation: 20399

You could use summarize from dplyr and group_by "rater". across helps you to calculate the correlation for each case:

library(dplyr)

set.seed(20241104)

cases_n <- 10L
m <- 100L
rater_n <- 5L

(df <- rnorm((cases_n + 1L) * m) %>%
  matrix(nrow = m) %>%
  as.data.frame() %>%
  setNames(c("theta", paste("case", 1:cases_n, sep = "_"))) %>%
  as_tibble() %>%
  mutate(rater = sample(paste("rater", 1:rater_n), m, TRUE), .before = 1L))
# # A tibble: 100 × 12
#    rater    theta case_1  case_2 case_3  case_4  case_5  case_6  case_7 case_8
#    <chr>    <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
#  1 rater 3 -0.478 -1.62  -1.96   -3.16   0.816  -0.719  -0.0381  0.187  -0.392
#  2 rater 2 -0.467  0.349  2.89    0.847  1.85   -0.310   0.546  -0.612   0.171
#  3 rater 2 -1.05   0.270  0.209   2.30  -0.400  -0.0329  0.532  -0.276  -0.391
#  4 rater 5 -0.136 -0.725  0.269  -0.206 -0.757  -1.81   -1.30   -0.900  -0.294
#  5 rater 4  0.248 -1.49  -0.0748 -0.418  1.33    0.459  -1.32   -0.230  -0.662
#  6 rater 5  1.75  -0.729 -0.950   0.862  0.0381 -1.52    0.534   0.0627  1.21 
#  7 rater 2 -2.12  -2.79  -0.137  -1.09   1.03   -1.10   -1.19   -1.02    0.290
#  8 rater 4  1.00   0.855 -0.476   0.841 -1.22   -0.841  -1.41   -0.816  -1.18 
#  9 rater 1 -0.262  1.00  -0.100  -1.36   0.202   0.523   0.121  -1.24   -0.800
# 10 rater 3  0.409  0.640  1.95    0.763 -0.290  -0.534   0.540   0.446  -1.53 
# # ℹ 90 more rows
# # ℹ 2 more variables: case_9 <dbl>, case_10 <dbl>

df %>%
  group_by(rater) %>%
  summarize(across(starts_with("case"), ~ cor(theta, .x)))

# # A tibble: 5 × 11
#   rater    case_1 case_2  case_3  case_4  case_5   case_6  case_7 case_8  case_9
#   <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
# 1 rater 1 -0.0177 -0.130 -0.0698 -0.0187  0.139   0.0834  -0.127   0.262 0.00358
# 2 rater 2  0.630   0.308 -0.134   0.0220  0.298   0.633   -0.383  -0.491 0.00557
# 3 rater 3  0.217   0.399  0.0253 -0.158  -0.357   0.166    0.0546  0.233 0.0639 
# 4 rater 4  0.303  -0.237  0.0498 -0.200   0.0255  0.00624  0.182  -0.359 0.0832 
# 5 rater 5 -0.0175  0.230  0.102   0.0774 -0.0120 -0.393   -0.145   0.274 0.230  
# # ℹ 1 more variable: case_10 <dbl>

Upvotes: 1

Correlation based on mutliple columns and rows

Answers (2)

Related Questions