Reputation: 715
An example dataframe:
example_df = data.frame(Gene.names = c("A", "B"),
Score = c("3.69,2.97,2.57,3.09,2.94",
"3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83"),
ResidueAA = c("S", "Y"),
ResidueNo = c(3, 3),
Sequence = c("MSSYT", "MSSYTRAP") )
I want to check if the character at ResidueAA column at the position at ResidueNo column matches with the corresponding position in the ‘Sequence’ column. The output should be another column, say, ‘Check’ with a Yes or No.
This is working code:
example_df$Check=sapply(1:nrow(example_df),FUN=function(i){d=example_df[i,]; substr(d$Sequence,d$ResidueNo,d$ResidueNo)==d$ResidueAA})
Is there an easier/elegant way to do this? Ideally, I want something that works within a dplyr pipe. Also, related to this, how can I extract the corresponding value from the 'Score' column into a new column, say, 'Score_1'?
Thanks
Upvotes: 1
Views: 106
Reputation: 887231
We can use substr
directly
library(dplyr)
example_df %>%
mutate(Check = substr(Sequence, ResidueNo, ResidueNo) == ResidueAA)
-output
# Gene.names Score ResidueAA ResidueNo Sequence Check
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT TRUE
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP FALSE
To create a new column with match
ing 'Score', use match
to get the corresponding index instead of ==
(which does an elementwise comparison) and use the index for extracting the 'Score' element
example_df %>%
mutate(Score2 = Score[match(ResidueAA,
substr(Sequence, ResidueNo, ResidueNo), ResidueAA)])
-output
#Gene.names Score ResidueAA ResidueNo Sequence
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP
# Score2
#1 3.69,2.97,2.57,3.09,2.94
#2 <NA>
Based on the comments, we need to extract the corresponding element of 'Score' based on the 'ResidueNo' if
the substr
ing values of 'Sequence' is the same as the 'ResidueAA'. This can be done by splitting the 'Score' with strsplit
into a list
, extract the first element ([[1]]
- after a rowwise
operation) and then use the 'ResidueNo' to get the splitted word on that location
example_df %>%
rowwise %>%
mutate(Score2 = if(substr(Sequence, ResidueNo, ResidueNo) ==
ResidueAA) strsplit(Score, ",")[[1]][ResidueNo] else NA_character_) %>%
ungroup
-output
# A tibble: 2 x 6
# Gene.names Score ResidueAA ResidueNo Sequence Score2
# <chr> <chr> <chr> <dbl> <chr> <chr>
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT 2.57
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP <NA>
Or another option is separate_rows
to split the rows to expand the data, then do a group by 'Gene.names', `summarise to get the corresponding 'Score2' element (similar to previous solution) and do a join with the original dataset
library(tidyr)
example_df %>%
separate_rows(Score, sep= ",") %>%
group_by(Gene.names) %>%
summarise(Score2 = if(substr(first(Sequence), first(ResidueNo), first(ResidueNo)) ==
first(ResidueAA)) Score[first(ResidueNo)] else
NA_character_, .groups = 'drop') %>%
right_join(example_df)
Upvotes: 1
Reputation: 8506
To get an individual score, you would need to split the string and return the index corresponding to the position. You could vectorize this, e.g.:
getScore <- Vectorize(function(x, pos) unlist(strsplit(x, ",", TRUE), use.names = FALSE)[pos])
example_df %>% mutate(check=substr(Sequence, ResidueNo, ResidueNo) == ResidueAA,
MyScore=ifelse(check, as.numeric(getScore(Score, ResidueNo)), NA))
#> Gene.names Score ResidueAA ResidueNo
#> 1 A 3.69,2.97,2.57,3.09,2.94 S 3
#> 2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3
#> Sequence check MyScore
#> 1 MSSYT TRUE 2.57
#> 2 MSSYTRAP FALSE NA
Upvotes: 1