Reputation: 5169
I have the following string in R code.
aas <- "QAWDIIKRIDKK"
And I want to check the longest continuous fragment of that string that contains the character in following vector:
hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
The answer is:
AW, II
Other example:
QFILVMD -> FILVM
How can I do that in R?
Upvotes: 6
Views: 146
Reputation: 2164
I'd suggest doing it like this. Haven't tested it, but since it uses vectorised operations, it should likely be plenty fast.
library(stringr)
get_longest_fragment <- function(aa, res) {
aa_vec <- str_split_1(aa, "")
delta <- diff(c(FALSE, aa_vec %in% res))
# find start and end of TRUE stretches
starts <- which(delta == 1)
ends <- which(delta == -1) - 1
len <- ends - starts
longest <- len == max(len)
# index the aa sequence
str_sub(aa, starts[longest], ends[longest])
}
get_longest_fragment(aa_sequence, hydrophobic_res)
#> [1] "AW" "II"
Upvotes: 1
Reputation: 34601
As you mentioned speed is important, consider using stringi
which is optimized for this kind of task. An advantage is that it's easy to vectorize as well:
library(stringi)
find_longest <- function(strng, pat) {
pats <- if (is.list(pat)) {
sapply(pat, \(x) stri_join(c("[", x, "]+"), collapse = ""))
} else {
stri_join(c("[", pat, "]+"), collapse = "")
}
res <- stri_extract_all(strng, regex = pats)
lapply(res, \(x) {
nc <- nchar(x)
x[nc == max(nc)]
})
}
hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
aas <- "QAWDIIKRIDKK"
aas2 <- "QFILVMD"
find_longest(c(aas, aas2), hydrophobic_res)
[[1]]
[1] "AW" "II"
[[2]]
[1] "FILVM"
Upvotes: 3
Reputation: 79174
Here is an alternative way: For me it is easier to solve such kind of task in thinking of tibbles or data frames:
library(data.table)
library(dplyr)
str_split(aas, "")[[1]] %>%
as_tibble() %>%
mutate(flag = grepl(paste(hydrophobic_res, collapse = "|"), value)) %>%
group_by(group = rleid(flag==TRUE)) %>%
filter(flag == TRUE & max(row_number() > 1)) %>%
mutate(string = paste(value, collapse = "")) %>%
slice(1) %>%
pull(string)
[1] "AW" "II"
Upvotes: 3
Reputation: 887541
One option - split the string, replace the non-matching elements from the key vector to NA, do a group by paste
based on the NA
created, and subset the elements based on the max
imum number of characters
f1 <- function(str1, matchvec)
{
v1 <- strsplit(str1, "")[[1]]
v1[!v1 %in% matchvec] <- NA
v2 <- tapply(v1, with(rle(!is.na(v1)),
rep(seq_along(values), lengths)),
FUN = function(x) paste(x[!is.na(x)], collapse = ""))
unname(v2[nchar(v2) == max(nchar(v2))])
}
-testing
> f1(aas, hydrophobic_res)
[1] "AW" "II"
> f1("QFILVMD", hydrophobic_res)
[1] "FILVM"
A regex based option - create pattern to remove all those characters that are not in the matchvec with gsub
, split and subset based on the number of characters
f2 <- function(str1, matchvec)
{
pat <- sprintf("[^%s]", paste(matchvec, collapse = ""))
v1 <- strsplit(gsub(pat, ",", str1), ",")[[1]]
v1[nchar(v1) == max(nchar(v1))]
}
-testing
> f2(aas, hydrophobic_res)
[1] "AW" "II"
> f2("QFILVMD", hydrophobic_res)
[1] "FILVM"
Upvotes: 3