Reputation: 23
I have two data sets with the same column names and I am trying to compare the data across the columns based upon sample ID. For example:
df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1 = c('T,T', 'A,T', 'C,T', 'T,T', 'T,G'),
loci2 = c('G,T', 'T,T', 'A,T', 'T,T', 'T,A'))
df2 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1 = c('T,T', 'A,T', 'C,T', 'A,A', 'C,G'),
loci2 = c('T,T', 'T,A', 'A,T', 'T,G', 'T,A'))
I'd like to develop a code to be able to check if df1$loci1 is the same as df2$loci1 for animal1 and so on as I have 200 animals and over 70 loci to iterate through. Ideally creating a new column with each loci indicating "match" or "no match".
I've begun by joining the two data frames and then using mutate to create a new column which outputs if loci1 matches for both data frames:
df3 <- df1 %>%
inner_join(df2, by = 'sample_ID') %>%
mutate(match_loci1 = c('no_match', 'match')[1 + (loci1.x == loci1.y)])
And this works well for 1 loci, but as I have quite a few to get through I am hoping to get help developing a for loop that will iterate through the different loci now labeled as loci$.x and loci$.y after I use inner_join, and create a new column for each with "match_loci1", "match_loci2", etc.
I've gotten as far as creating a list of all loci and initiating the for loop:
loci_names <- colnames(df1)
test2 <- df1 %>%
inner_join(df2, by = 'sample_ID') %>%
for (i in loci_list) {
mutate(match$[[i]] = c('no_match', 'match')[1 + [[i]]$.x == [[i]]$.y])
}
but I get this error:
Error: unexpected '[[' in: " for (i in loci_list) { mutate(match$[["
I am not sure how to format the mutate action so that it will iterate through each loci.
Upvotes: 2
Views: 61
Reputation: 73572
merge
, set equal grep
ed columns.
> mg <- merge(df1, df2, by='sample_ID')
> cbind(mg, match=mg[grep('\\.x$', names(mg))] == mg[grep('\\.y$', names(mg))])
sample_ID loci1.x loci2.x loci1.y loci2.y match.loci1.x match.loci2.x
1 animal1 T,T G,T T,T T,T TRUE FALSE
2 animal2 A,T T,T A,T T,A TRUE FALSE
3 animal3 C,T A,T C,T A,T TRUE TRUE
4 animal4 T,T T,T A,A T,G FALSE FALSE
5 animal5 T,G T,A C,G T,A FALSE TRUE
Upvotes: 2
Reputation: 26695
Unless you need to use a for loop for some reason, one potential solution is to join your dataframes, then pivot_longer()
and use a single mutate to compare loci, e.g.
library(tidyverse)
df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1 = c('T,T', 'A,T', 'C,T', 'T,T', 'T,G'),
loci2 = c('G,T', 'T,T', 'A,T', 'T,T', 'T,A'))
df2 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1 = c('T,T', 'A,T', 'C,T', 'A,A', 'C,G'),
loci2 = c('T,T', 'T,A', 'A,T', 'T,G', 'T,A'))
df1 %>%
full_join(df2, by = "sample_ID", suffix = c(".df1", ".df2")) %>%
pivot_longer(-sample_ID, names_sep = "\\.",
names_to = c("loci", ".value")) %>%
mutate(match = ifelse(df1 == df2, "match", "no_match"))
#> # A tibble: 10 × 5
#> sample_ID loci df1 df2 match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 animal1 loci1 T,T T,T match
#> 2 animal1 loci2 G,T T,T no_match
#> 3 animal2 loci1 A,T A,T match
#> 4 animal2 loci2 T,T T,A no_match
#> 5 animal3 loci1 C,T C,T match
#> 6 animal3 loci2 A,T A,T match
#> 7 animal4 loci1 T,T A,A no_match
#> 8 animal4 loci2 T,T T,G no_match
#> 9 animal5 loci1 T,G C,G no_match
#> 10 animal5 loci2 T,A T,A match
Created on 2024-04-03 with reprex v2.1.0
Upvotes: 1
Reputation: 146070
I would pivot your data, then join:
library(dplyr)
library(tidyr)
df1 |> pivot_longer(-sample_ID, names_to = "loci_i", values_to = "df1_value") |>
full_join(
df2 |> pivot_longer(-sample_ID, names_to = "loci_i", values_to = "df2_value"),
by = c("sample_ID", "loci_i")
) |>
mutate(is_match = df1_value == df2_value)
# # A tibble: 10 × 5
# sample_ID loci_i df1_value df2_value is_match
# <chr> <chr> <chr> <chr> <lgl>
# 1 animal1 loci1 T,T T,T TRUE
# 2 animal1 loci2 G,T T,T FALSE
# 3 animal2 loci1 A,T A,T TRUE
# 4 animal2 loci2 T,T T,A FALSE
# 5 animal3 loci1 C,T C,T TRUE
# 6 animal3 loci2 A,T A,T TRUE
# 7 animal4 loci1 T,T A,A FALSE
# 8 animal4 loci2 T,T T,G FALSE
# 9 animal5 loci1 T,G C,G FALSE
# 10 animal5 loci2 T,A T,A TRUE
Upvotes: 3