Davy
Davy

Reputation: 1

Converting a function of nested R for-loops to apply() family functions for efficiency

I have 2 similar data sets.

d1 <- tribble(
  ~individual, ~X1, ~X2, ~X3,
  "p1", "XX", "XY", "YY",
  "p2", "XY", "XY", "YY",
  "p3", "YY", "XX", "XX"
)

d2 <- tribble(
  ~individual, ~X1, ~X2, ~X3,
  "p1", "XX", "XY", "YY",
  "p2", "XY", "XY", "YY",
  "p3", "YY", "XX", "XX",
  "p4", "YY", "XX", "XX",
  "p5", "YY", "XX", "XX"
)

I made a function to compare d1 to d2. The comparison takes each indavidual in d1 and compares ir to every indavidual in d2 by corrasponding columns. A score is given for each comparison. Then the mean of scores for each individual is reutrned.

scoreData <- function(d1, d2) {
 require(tidyverse)
 output <- data.frame() %>%
  mutate("name1", "name2", "meanScore")
    
 colNames <- names(d1)[-1]
    
 for(i in 1:nrow(d1)){
  name1 <- NULL
  name1 <- d1$individual[i]
  for(j in 1:nrow(d2)){
   name2 <- NULL
   name2 <- d2$individual[j]
   scores <- NULL
   for(k in 1:length(colName)){
    col <- NULL
    col <- colNames[k]
    score = case_when(
     d1[i,col] == "XX" && d2[j,col] == "XX" ~ 1.0,
     d1[i,col] == "XX" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "XX" && d2[j,col] == "YY" ~ 0.0,
     d1[i,col] == "YY" && d2[j,col] == "XX" ~ 0.0,
     d1[i,col] == "YY" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "YY" && d2[j,col] == "YY" ~ 1.0,
     d1[i,col] == "XY" && d2[j,col] == "XX" ~ 0.5,
     d1[i,col] == "XY" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "XY" && d2[j,col] == "YY" ~ 0.5
     )
    scores <- append(scores, score)
    k = k + 1
    }
   meanScore <- mean(scores, na.rm = TRUE)
   output <- rbind(output, cbind(name1, name2, meanScore))
   j = j + 1
   }
  i = i + 1
  }
 return(output)
 }

The problem is my real datasets are very large and I need to make my code more efficent. I know that the family of apply() functions are more efficent than using for loops in R. But, I am not sure how to use them to replicate this nested forloop. eventually, I would like to parellelize the apply functions to make this scoring function more efficient. Any ideas or help would be geatly appriciated.

Upvotes: 0

Views: 32

Answers (1)

stefan
stefan

Reputation: 123818

One option to avoid the for loops would be to approach your task via a join which allows to vectorize the comparisons and the computation of the means:

library(dplyr)

d1 |>
  merge(d2, by = NULL, suffixes = c("", ".y")) |> 
  mutate(across(matches("^X\\d$"), list(score = function(x) {
    y <- cur_data()[[paste0(cur_column(), ".y")]]
    
    case_when(
      x == "XX" & y == "XX" ~ 1,
      x == "XX" & y == "YY" ~ 0,
      x == "YY" & y == "XX" ~ 0,
      x == "YY" & y== "YY" ~ 1,
      TRUE ~ .5
    )
  }))) |> 
  mutate(meanScore = rowMeans(across(ends_with("score")))) |> 
  select(name1 = individual, name2 = individual.y, meanScore)
#> # A tibble: 15 × 3
#>    name1 name2 meanScore
#>    <chr> <chr>     <dbl>
#>  1 p1    p1        0.833
#>  2 p1    p2        0.667
#>  3 p1    p3        0.167
#>  4 p1    p4        0.167
#>  5 p1    p5        0.167
#>  6 p2    p1        0.667
#>  7 p2    p2        0.667
#>  8 p2    p3        0.333
#>  9 p2    p4        0.333
#> 10 p2    p5        0.333
#> 11 p3    p1        0.167
#> 12 p3    p2        0.333
#> 13 p3    p3        1    
#> 14 p3    p4        1    
#> 15 p3    p5        1

Upvotes: 1

Related Questions