Reputation: 3584
What would be the best way to abbreviate an example string ANNNNNNTCCGGG
into AN6TCCG3
so that it counts all characters that repeat more than 2 times, and expresses them in numbers?
Upvotes: 0
Views: 134
Reputation: 70732
If "performance/speed" is not an issue, here is another approach:
library(gsubfn)
gsubfn('(.)\\1{2,}', ~ paste0(x, nchar(`&`)), 'ANNNNNNTCCGGG')
# [1] "AN6TCCG3"
Upvotes: 3
Reputation: 61933
I suspect there is probably a package in bioconductor that will do what you want but it isn't too hard to throw something together in base R
rle_shortener <- function(strings){
cvecs <- strsplit(strings, "")
sapply(cvecs, function(input){
# Get the run length encoding of the input
r <- rle(input)
lens <- r$lengths
# replace the 1s with blanks so that they
# don't show up in the resulting string
lens[lens == 1] <- ""
# paste the character with the lengths
paste(r$values, lens, collapse = "", sep = "")
})
}
> rle_shortener(c("heeeeyo", "ANNNNNNTCCGGG"))
[1] "he4yo" "AN6TC2G3"
Upvotes: 3
Reputation: 330153
There is probably a faster way but using base R
r <- rle(unlist(strsplit("ANNNNNNTCCGGG", ""))) # Compute RLE
m <- rbind(r$values, r$lengths) # Combine
paste(ifelse(m == 1, "", m), collapse="")
Upvotes: 6