Reputation: 5169

How to find longest continuous contiguous set of characters in a string based on a given vector

I have the following string in R code.

aas <- "QAWDIIKRIDKK"

And I want to check the longest continuous fragment of that string that contains the character in following vector:

hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")

The answer is:

AW, II

Other example:

QFILVMD -> FILVM

How can I do that in R?

Upvotes: 6

Answers (4)

Peter H.

Reputation: 2164

I'd suggest doing it like this. Haven't tested it, but since it uses vectorised operations, it should likely be plenty fast.

library(stringr)

get_longest_fragment <- function(aa, res) {
  aa_vec <- str_split_1(aa, "")
  delta <- diff(c(FALSE, aa_vec %in% res))
  
  # find start and end of TRUE stretches
  starts <- which(delta == 1)
  ends   <- which(delta == -1) - 1
  
  len <- ends - starts
  longest <- len == max(len)
  
  # index the aa sequence 
  str_sub(aa, starts[longest], ends[longest])
}

get_longest_fragment(aa_sequence, hydrophobic_res)
#> [1] "AW" "II"

Upvotes: 1

lroha

Reputation: 34601

As you mentioned speed is important, consider using stringi which is optimized for this kind of task. An advantage is that it's easy to vectorize as well:

library(stringi)

find_longest <- function(strng, pat) {
  pats <- if (is.list(pat)) {
    sapply(pat, \(x) stri_join(c("[", x, "]+"), collapse = ""))
  } else {
    stri_join(c("[", pat, "]+"), collapse = "")
  }
  res <- stri_extract_all(strng, regex = pats)
  lapply(res, \(x) {
    nc <- nchar(x)
    x[nc == max(nc)]
  })
}

hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
aas <- "QAWDIIKRIDKK"
aas2 <- "QFILVMD"


find_longest(c(aas, aas2), hydrophobic_res)

[[1]]
[1] "AW" "II"

[[2]]
[1] "FILVM"

Upvotes: 3

TarJae

Reputation: 79174

Here is an alternative way: For me it is easier to solve such kind of task in thinking of tibbles or data frames:

library(data.table)
library(dplyr)
str_split(aas, "")[[1]] %>% 
  as_tibble() %>% 
  mutate(flag = grepl(paste(hydrophobic_res, collapse = "|"), value)) %>% 
  group_by(group = rleid(flag==TRUE)) %>% 
  filter(flag == TRUE & max(row_number() > 1)) %>% 
  mutate(string = paste(value, collapse = "")) %>% 
  slice(1) %>% 
  pull(string)

[1] "AW" "II"

Upvotes: 3

akrun

Reputation: 887541

One option - split the string, replace the non-matching elements from the key vector to NA, do a group by paste based on the NA created, and subset the elements based on the maximum number of characters

f1 <- function(str1, matchvec)
{
v1 <- strsplit(str1, "")[[1]]
v1[!v1 %in% matchvec] <- NA
v2 <- tapply(v1, with(rle(!is.na(v1)),
      rep(seq_along(values), lengths)),
   FUN = function(x) paste(x[!is.na(x)], collapse = ""))
unname(v2[nchar(v2) == max(nchar(v2))])


}

-testing

> f1(aas, hydrophobic_res)
[1] "AW" "II"
> f1("QFILVMD", hydrophobic_res)
[1] "FILVM"

A regex based option - create pattern to remove all those characters that are not in the matchvec with gsub, split and subset based on the number of characters

f2 <- function(str1, matchvec)
  {
  pat <- sprintf("[^%s]", paste(matchvec, collapse = ""))
  v1 <- strsplit(gsub(pat, ",", str1), ",")[[1]]
  v1[nchar(v1) == max(nchar(v1))]
}

-testing

> f2(aas, hydrophobic_res)
[1] "AW" "II"
> f2("QFILVMD", hydrophobic_res)
[1] "FILVM"

Upvotes: 3

How to find longest continuous contiguous set of characters in a string based on a given vector

Answers (4)

Related Questions