Reputation: 1122
I would like to compare two columns with genotypes and generate a new boolean column. However, there is something different, for instances, GG can also equal to CC or AA can also equal to TT and vice versa.
df:
rsid ref sample
rs104211 CC GG
rs104998 AA TT
rs105063 TT AA
rs105076 AA AA
rs105078 TT GG
rs105090 AA GG
rs105162 AC AC
rs105304 AA TT
rs105338 AA GG
rs105490 GG CC
rs105491 AA AA
rs105492 AG AG
rs105705 AC AC
rs105975 AA GG
rs106213 AA AA
rs106396 GG CC
desired output:
rsid ref sample boolean
rs104211 CC GG TRUE
rs104998 AA TT TRUE
rs105063 TT AA TRUE
rs105076 AA AA TRUE
rs105078 TT GG FALSE
rs105090 AA GG FALSE
rs105162 AC AC TRUE
rs105304 AA TT TRUE
rs105338 AA GG FALSE
rs105490 GG CC TRUE
rs105491 AA AA TRUE
rs105492 AG AG TRUE
rs105705 AC AC TRUE
rs105975 AA GG FALSE
rs106213 AA AA TRUE
rs106396 GG CC TRUE
code:
match.boolean <- function(x) {
df <- if (x=="CC" | x=="GG" ) {
print("TRUE")
} else if (x=="AA" | x=="TT") {
print("TRUE")
} else if (x=="AC" | x=="AG") {
print("TRUE")
} else {
print("FALSE")
}
return(df)
}
df$boolean <- lapply(df,function(x) match.boolean(df[,2]==df[,3]))
But it is wrong.
Upvotes: 1
Views: 549
Reputation: 263451
Try this (at least that's what I think the logical expression would be for some of your unstated possibilities):
df$boolean <- with(df, ref == sample |
(ref %in% c("CC","GG") & sample %in% c("GG", "CC") )|
(ref %in% c("TT","AA") & sample %in% c("TT", "AA") ),
)
> df
rsid ref sample boolean
1 rs104211 CC GG TRUE
2 rs104998 AA TT TRUE
3 rs105063 TT AA TRUE
4 rs105076 AA AA TRUE
5 rs105078 TT GG FALSE
6 rs105090 AA GG FALSE
7 rs105162 AC AC FALSE
8 rs105304 AA TT TRUE
9 rs105338 AA GG FALSE
10 rs105490 GG CC TRUE
11 rs105491 AA AA TRUE
12 rs105492 AG AG FALSE
13 rs105705 AC AC FALSE
14 rs105975 AA GG FALSE
15 rs106213 AA AA TRUE
16 rs106396 GG CC TRUE
Upvotes: 3
Reputation: 389225
We can create a named comparison_list
with all the possible values it can take and then use mapply
comparison_list <- list(GGCC = c("GG", "CC"), AATT = c("AA", "TT"),
ACAG = c("AC", "AG"))
df$boolean <- mapply(function(x, y)
any(comparison_list[[grep(x, names(comparison_list))]] ==
comparison_list[[grep(y, names(comparison_list))]]),
df$ref, df$sample)
df
# rsid ref sample boolean
#1 rs104211 CC GG TRUE
#2 rs104998 AA TT TRUE
#3 rs105063 TT AA TRUE
#4 rs105076 AA AA TRUE
#5 rs105078 TT GG FALSE
#6 rs105090 AA GG FALSE
#7 rs105162 AC AC TRUE
#8 rs105304 AA TT TRUE
#9 rs105338 AA GG FALSE
#10 rs105490 GG CC TRUE
#11 rs105491 AA AA TRUE
#12 rs105492 AG AG TRUE
#13 rs105705 AC AC TRUE
#14 rs105975 AA GG FALSE
#15 rs106213 AA AA TRUE
#16 rs106396 GG CC TRUE
The above suggestion is to reduce the length of the list. You could also create separate element for every value and it will make your comparison code simpler
comparison_list <- list(GG = c("GG", "CC"), CC = c("GG", "CC"),
AA = c("AA", "TT"), TT = c("AA", "TT"),
AC = c("AC", "AG"), AG = c("AC", "AG"))
df$boolean <- mapply(function(x, y) any(comparison_list[[x]]==comparison_list[[y]]),
df$ref, df$sample)
Upvotes: 1