Peter Chung
Peter Chung

Reputation: 1122

mutate new boolean column by comparing two columns

I would like to compare two columns with genotypes and generate a new boolean column. However, there is something different, for instances, GG can also equal to CC or AA can also equal to TT and vice versa.

df: 
rsid    ref sample
rs104211    CC  GG
rs104998    AA  TT
rs105063    TT  AA
rs105076    AA  AA
rs105078    TT  GG
rs105090    AA  GG
rs105162    AC  AC
rs105304    AA  TT
rs105338    AA  GG
rs105490    GG  CC
rs105491    AA  AA
rs105492    AG  AG
rs105705    AC  AC
rs105975    AA  GG
rs106213    AA  AA
rs106396    GG  CC

desired output:

rsid    ref sample  boolean
rs104211    CC  GG  TRUE
rs104998    AA  TT  TRUE
rs105063    TT  AA  TRUE
rs105076    AA  AA  TRUE
rs105078    TT  GG  FALSE
rs105090    AA  GG  FALSE
rs105162    AC  AC  TRUE
rs105304    AA  TT  TRUE
rs105338    AA  GG  FALSE
rs105490    GG  CC  TRUE
rs105491    AA  AA  TRUE
rs105492    AG  AG  TRUE
rs105705    AC  AC  TRUE
rs105975    AA  GG  FALSE
rs106213    AA  AA  TRUE
rs106396    GG  CC  TRUE

code:
match.boolean <- function(x) {
df <- if (x=="CC" | x=="GG" ) {
print("TRUE") 
} else if (x=="AA" | x=="TT") {
print("TRUE")
} else if (x=="AC" | x=="AG") {
print("TRUE")
} else {
print("FALSE")
}
return(df)
}

df$boolean <- lapply(df,function(x) match.boolean(df[,2]==df[,3]))

But it is wrong.

Upvotes: 1

Views: 549

Answers (2)

IRTFM
IRTFM

Reputation: 263451

Try this (at least that's what I think the logical expression would be for some of your unstated possibilities):

df$boolean <- with(df, ref == sample |
                             (ref %in% c("CC","GG") & sample %in% c("GG", "CC") )| 
                             (ref %in% c("TT","AA") & sample %in% c("TT", "AA") ), 
                 )
> df
       rsid ref sample boolean
1  rs104211  CC     GG    TRUE
2  rs104998  AA     TT    TRUE
3  rs105063  TT     AA    TRUE
4  rs105076  AA     AA    TRUE
5  rs105078  TT     GG   FALSE
6  rs105090  AA     GG   FALSE
7  rs105162  AC     AC   FALSE
8  rs105304  AA     TT    TRUE
9  rs105338  AA     GG   FALSE
10 rs105490  GG     CC    TRUE
11 rs105491  AA     AA    TRUE
12 rs105492  AG     AG   FALSE
13 rs105705  AC     AC   FALSE
14 rs105975  AA     GG   FALSE
15 rs106213  AA     AA    TRUE
16 rs106396  GG     CC    TRUE

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389225

We can create a named comparison_list with all the possible values it can take and then use mapply

comparison_list <- list(GGCC = c("GG", "CC"), AATT = c("AA", "TT"),
                        ACAG = c("AC", "AG"))


df$boolean <- mapply(function(x, y) 
              any(comparison_list[[grep(x, names(comparison_list))]] == 
                  comparison_list[[grep(y, names(comparison_list))]]), 
              df$ref, df$sample)

df
#       rsid ref sample boolean
#1  rs104211  CC     GG    TRUE
#2  rs104998  AA     TT    TRUE
#3  rs105063  TT     AA    TRUE
#4  rs105076  AA     AA    TRUE
#5  rs105078  TT     GG   FALSE
#6  rs105090  AA     GG   FALSE
#7  rs105162  AC     AC    TRUE
#8  rs105304  AA     TT    TRUE
#9  rs105338  AA     GG   FALSE
#10 rs105490  GG     CC    TRUE
#11 rs105491  AA     AA    TRUE
#12 rs105492  AG     AG    TRUE
#13 rs105705  AC     AC    TRUE
#14 rs105975  AA     GG   FALSE
#15 rs106213  AA     AA    TRUE
#16 rs106396  GG     CC    TRUE

The above suggestion is to reduce the length of the list. You could also create separate element for every value and it will make your comparison code simpler

comparison_list <- list(GG = c("GG", "CC"), CC = c("GG", "CC"), 
                        AA = c("AA", "TT"), TT = c("AA", "TT"), 
                        AC = c("AC", "AG"), AG = c("AC", "AG"))

df$boolean <- mapply(function(x, y) any(comparison_list[[x]]==comparison_list[[y]]), 
                df$ref, df$sample)

Upvotes: 1

Related Questions