ip2018
ip2018

Reputation: 715

match values in 2 columns with the corresponding position in another character column

An example dataframe:

example_df = data.frame(Gene.names = c("A", "B"),
                         Score = c("3.69,2.97,2.57,3.09,2.94",
                                   "3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83"),
                         ResidueAA = c("S", "Y"),
                         ResidueNo = c(3, 3),
                         Sequence = c("MSSYT", "MSSYTRAP") )

I want to check if the character at ResidueAA column at the position at ResidueNo column matches with the corresponding position in the ‘Sequence’ column. The output should be another column, say, ‘Check’ with a Yes or No.

This is working code:

example_df$Check=sapply(1:nrow(example_df),FUN=function(i){d=example_df[i,]; substr(d$Sequence,d$ResidueNo,d$ResidueNo)==d$ResidueAA})

Is there an easier/elegant way to do this? Ideally, I want something that works within a dplyr pipe. Also, related to this, how can I extract the corresponding value from the 'Score' column into a new column, say, 'Score_1'?

Thanks

Upvotes: 1

Views: 106

Answers (2)

akrun
akrun

Reputation: 887231

We can use substr directly

library(dplyr)
example_df  %>%
   mutate(Check = substr(Sequence, ResidueNo, ResidueNo) == ResidueAA)

-output

#  Gene.names                                   Score ResidueAA ResidueNo Sequence Check
#1          A                3.69,2.97,2.57,3.09,2.94         S         3    MSSYT  TRUE
#2          B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83         Y         3 MSSYTRAP FALSE

To create a new column with matching 'Score', use match to get the corresponding index instead of == (which does an elementwise comparison) and use the index for extracting the 'Score' element

example_df  %>%
    mutate(Score2 =  Score[match(ResidueAA,
         substr(Sequence, ResidueNo, ResidueNo), ResidueAA)])

-output

#Gene.names                                   Score ResidueAA ResidueNo Sequence
#1          A                3.69,2.97,2.57,3.09,2.94         S         3    MSSYT
#2          B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83         Y         3 MSSYTRAP
#                    Score2
#1 3.69,2.97,2.57,3.09,2.94
#2                     <NA>

Update

Based on the comments, we need to extract the corresponding element of 'Score' based on the 'ResidueNo' if the substring values of 'Sequence' is the same as the 'ResidueAA'. This can be done by splitting the 'Score' with strsplit into a list, extract the first element ([[1]] - after a rowwise operation) and then use the 'ResidueNo' to get the splitted word on that location

example_df  %>%
  rowwise %>% 
  mutate(Score2 = if(substr(Sequence, ResidueNo, ResidueNo) == 
    ResidueAA) strsplit(Score, ",")[[1]][ResidueNo] else NA_character_) %>%
  ungroup

-output

# A tibble: 2 x 6
#  Gene.names Score                                   ResidueAA ResidueNo Sequence Score2
#  <chr>      <chr>                                   <chr>         <dbl> <chr>    <chr> 
#1 A          3.69,2.97,2.57,3.09,2.94                S                 3 MSSYT    2.57  
#2 B          3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y                 3 MSSYTRAP <NA>  

Or another option is separate_rows to split the rows to expand the data, then do a group by 'Gene.names', `summarise to get the corresponding 'Score2' element (similar to previous solution) and do a join with the original dataset

library(tidyr)
example_df %>%
    separate_rows(Score, sep= ",") %>% 
    group_by(Gene.names) %>% 
    summarise(Score2 = if(substr(first(Sequence), first(ResidueNo), first(ResidueNo)) ==
       first(ResidueAA)) Score[first(ResidueNo)] else
         NA_character_, .groups = 'drop') %>% 
    right_join(example_df)

Upvotes: 1

user12728748
user12728748

Reputation: 8506

To get an individual score, you would need to split the string and return the index corresponding to the position. You could vectorize this, e.g.:

getScore <- Vectorize(function(x, pos) unlist(strsplit(x, ",", TRUE), use.names = FALSE)[pos])
example_df %>% mutate(check=substr(Sequence, ResidueNo, ResidueNo) == ResidueAA, 
                      MyScore=ifelse(check, as.numeric(getScore(Score, ResidueNo)), NA))
#>   Gene.names                                   Score ResidueAA ResidueNo
#> 1          A                3.69,2.97,2.57,3.09,2.94         S         3
#> 2          B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83         Y         3
#>   Sequence check MyScore
#> 1    MSSYT  TRUE    2.57
#> 2 MSSYTRAP FALSE      NA

Upvotes: 1

Related Questions