Reputation: 1
I have 2 similar data sets.
d1 <- tribble(
~individual, ~X1, ~X2, ~X3,
"p1", "XX", "XY", "YY",
"p2", "XY", "XY", "YY",
"p3", "YY", "XX", "XX"
)
d2 <- tribble(
~individual, ~X1, ~X2, ~X3,
"p1", "XX", "XY", "YY",
"p2", "XY", "XY", "YY",
"p3", "YY", "XX", "XX",
"p4", "YY", "XX", "XX",
"p5", "YY", "XX", "XX"
)
I made a function to compare d1 to d2. The comparison takes each indavidual in d1 and compares ir to every indavidual in d2 by corrasponding columns. A score is given for each comparison. Then the mean of scores for each individual is reutrned.
scoreData <- function(d1, d2) {
require(tidyverse)
output <- data.frame() %>%
mutate("name1", "name2", "meanScore")
colNames <- names(d1)[-1]
for(i in 1:nrow(d1)){
name1 <- NULL
name1 <- d1$individual[i]
for(j in 1:nrow(d2)){
name2 <- NULL
name2 <- d2$individual[j]
scores <- NULL
for(k in 1:length(colName)){
col <- NULL
col <- colNames[k]
score = case_when(
d1[i,col] == "XX" && d2[j,col] == "XX" ~ 1.0,
d1[i,col] == "XX" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "XX" && d2[j,col] == "YY" ~ 0.0,
d1[i,col] == "YY" && d2[j,col] == "XX" ~ 0.0,
d1[i,col] == "YY" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "YY" && d2[j,col] == "YY" ~ 1.0,
d1[i,col] == "XY" && d2[j,col] == "XX" ~ 0.5,
d1[i,col] == "XY" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "XY" && d2[j,col] == "YY" ~ 0.5
)
scores <- append(scores, score)
k = k + 1
}
meanScore <- mean(scores, na.rm = TRUE)
output <- rbind(output, cbind(name1, name2, meanScore))
j = j + 1
}
i = i + 1
}
return(output)
}
The problem is my real datasets are very large and I need to make my code more efficent. I know that the family of apply() functions are more efficent than using for loops in R. But, I am not sure how to use them to replicate this nested forloop. eventually, I would like to parellelize the apply functions to make this scoring function more efficient. Any ideas or help would be geatly appriciated.
Upvotes: 0
Views: 32
Reputation: 123818
One option to avoid the for loops would be to approach your task via a join which allows to vectorize the comparisons and the computation of the means:
library(dplyr)
d1 |>
merge(d2, by = NULL, suffixes = c("", ".y")) |>
mutate(across(matches("^X\\d$"), list(score = function(x) {
y <- cur_data()[[paste0(cur_column(), ".y")]]
case_when(
x == "XX" & y == "XX" ~ 1,
x == "XX" & y == "YY" ~ 0,
x == "YY" & y == "XX" ~ 0,
x == "YY" & y== "YY" ~ 1,
TRUE ~ .5
)
}))) |>
mutate(meanScore = rowMeans(across(ends_with("score")))) |>
select(name1 = individual, name2 = individual.y, meanScore)
#> # A tibble: 15 × 3
#> name1 name2 meanScore
#> <chr> <chr> <dbl>
#> 1 p1 p1 0.833
#> 2 p1 p2 0.667
#> 3 p1 p3 0.167
#> 4 p1 p4 0.167
#> 5 p1 p5 0.167
#> 6 p2 p1 0.667
#> 7 p2 p2 0.667
#> 8 p2 p3 0.333
#> 9 p2 p4 0.333
#> 10 p2 p5 0.333
#> 11 p3 p1 0.167
#> 12 p3 p2 0.333
#> 13 p3 p3 1
#> 14 p3 p4 1
#> 15 p3 p5 1
Upvotes: 1