Reputation: 23

How can I create a for loop in R to iterate through a list of column names across two data frames and output match/no match for each sample?

I have two data sets with the same column names and I am trying to compare the data across the columns based upon sample ID. For example:

  df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1 = c('T,T', 'A,T', 'C,T', 'T,T', 'T,G'),
                  loci2 = c('G,T', 'T,T', 'A,T', 'T,T', 'T,A'))
  df2 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1 = c('T,T', 'A,T', 'C,T', 'A,A', 'C,G'),
                  loci2 = c('T,T', 'T,A', 'A,T', 'T,G', 'T,A'))

I'd like to develop a code to be able to check if df1$loci1 is the same as df2$loci1 for animal1 and so on as I have 200 animals and over 70 loci to iterate through. Ideally creating a new column with each loci indicating "match" or "no match".

I've begun by joining the two data frames and then using mutate to create a new column which outputs if loci1 matches for both data frames:

   df3 <- df1 %>%
    inner_join(df2, by = 'sample_ID') %>%
      mutate(match_loci1 = c('no_match', 'match')[1 + (loci1.x == loci1.y)])

And this works well for 1 loci, but as I have quite a few to get through I am hoping to get help developing a for loop that will iterate through the different loci now labeled as loci$.x and loci$.y after I use inner_join, and create a new column for each with "match_loci1", "match_loci2", etc.

I've gotten as far as creating a list of all loci and initiating the for loop:

loci_names <- colnames(df1)

  test2 <- df1 %>% 
  inner_join(df2, by = 'sample_ID') %>%
  for (i in loci_list) {
    mutate(match$[[i]] = c('no_match', 'match')[1 + [[i]]$.x == [[i]]$.y])
  }

but I get this error:

Error: unexpected '[[' in: " for (i in loci_list) { mutate(match$[["

I am not sure how to format the mutate action so that it will iterate through each loci.

Upvotes: 2

Answers (3)

jay.sf

Reputation: 73572

merge, set equal greped columns.

> mg <- merge(df1, df2, by='sample_ID') 
> cbind(mg, match=mg[grep('\\.x$', names(mg))] == mg[grep('\\.y$', names(mg))])
  sample_ID loci1.x loci2.x loci1.y loci2.y match.loci1.x match.loci2.x
1   animal1     T,T     G,T     T,T     T,T          TRUE         FALSE
2   animal2     A,T     T,T     A,T     T,A          TRUE         FALSE
3   animal3     C,T     A,T     C,T     A,T          TRUE          TRUE
4   animal4     T,T     T,T     A,A     T,G         FALSE         FALSE
5   animal5     T,G     T,A     C,G     T,A         FALSE          TRUE

Upvotes: 2

jared_mamrot

Reputation: 26695

Unless you need to use a for loop for some reason, one potential solution is to join your dataframes, then pivot_longer() and use a single mutate to compare loci, e.g.

library(tidyverse)

df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1 = c('T,T', 'A,T', 'C,T', 'T,T', 'T,G'),
                  loci2 = c('G,T', 'T,T', 'A,T', 'T,T', 'T,A'))
df2 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1 = c('T,T', 'A,T', 'C,T', 'A,A', 'C,G'),
                  loci2 = c('T,T', 'T,A', 'A,T', 'T,G', 'T,A'))

df1 %>%
  full_join(df2, by = "sample_ID", suffix = c(".df1", ".df2")) %>%
  pivot_longer(-sample_ID, names_sep = "\\.",
               names_to = c("loci", ".value")) %>%
  mutate(match = ifelse(df1 == df2, "match", "no_match"))
#> # A tibble: 10 × 5
#>    sample_ID loci  df1   df2   match   
#>    <chr>     <chr> <chr> <chr> <chr>   
#>  1 animal1   loci1 T,T   T,T   match   
#>  2 animal1   loci2 G,T   T,T   no_match
#>  3 animal2   loci1 A,T   A,T   match   
#>  4 animal2   loci2 T,T   T,A   no_match
#>  5 animal3   loci1 C,T   C,T   match   
#>  6 animal3   loci2 A,T   A,T   match   
#>  7 animal4   loci1 T,T   A,A   no_match
#>  8 animal4   loci2 T,T   T,G   no_match
#>  9 animal5   loci1 T,G   C,G   no_match
#> 10 animal5   loci2 T,A   T,A   match

^{Created on 2024-04-03 with reprex v2.1.0}

Upvotes: 1

Gregor Thomas

Reputation: 146070

I would pivot your data, then join:

library(dplyr)
library(tidyr)  
df1 |> pivot_longer(-sample_ID, names_to = "loci_i", values_to = "df1_value") |>
  full_join(
    df2 |> pivot_longer(-sample_ID, names_to = "loci_i", values_to = "df2_value"),
    by = c("sample_ID", "loci_i")
  ) |>
  mutate(is_match = df1_value == df2_value)
# # A tibble: 10 × 5
#    sample_ID loci_i df1_value df2_value is_match
#    <chr>     <chr>  <chr>     <chr>     <lgl>   
#  1 animal1   loci1  T,T       T,T       TRUE    
#  2 animal1   loci2  G,T       T,T       FALSE   
#  3 animal2   loci1  A,T       A,T       TRUE    
#  4 animal2   loci2  T,T       T,A       FALSE   
#  5 animal3   loci1  C,T       C,T       TRUE    
#  6 animal3   loci2  A,T       A,T       TRUE    
#  7 animal4   loci1  T,T       A,A       FALSE   
#  8 animal4   loci2  T,T       T,G       FALSE   
#  9 animal5   loci1  T,G       C,G       FALSE   
# 10 animal5   loci2  T,A       T,A       TRUE

Upvotes: 3

How can I create a for loop in R to iterate through a list of column names across two data frames and output match/no match for each sample?

Answers (3)

Related Questions