alki
alki

Reputation: 3584

R: Shortening strings

What would be the best way to abbreviate an example string ANNNNNNTCCGGG into AN6TCCG3 so that it counts all characters that repeat more than 2 times, and expresses them in numbers?

Upvotes: 0

Views: 134

Answers (3)

hwnd
hwnd

Reputation: 70732

If "performance/speed" is not an issue, here is another approach:

library(gsubfn)
gsubfn('(.)\\1{2,}', ~ paste0(x, nchar(`&`)), 'ANNNNNNTCCGGG')
# [1] "AN6TCCG3"

Upvotes: 3

Dason
Dason

Reputation: 61933

I suspect there is probably a package in bioconductor that will do what you want but it isn't too hard to throw something together in base R

rle_shortener <- function(strings){
    cvecs <- strsplit(strings, "")
    sapply(cvecs, function(input){
        # Get the run length encoding of the input
        r <- rle(input)
        lens <- r$lengths
        # replace the 1s with blanks so that they
        # don't show up in the resulting string
        lens[lens == 1] <- ""
        # paste the character with the lengths
        paste(r$values, lens, collapse = "", sep = "")
    })
}


> rle_shortener(c("heeeeyo", "ANNNNNNTCCGGG"))
[1] "he4yo"    "AN6TC2G3"

Upvotes: 3

zero323
zero323

Reputation: 330153

There is probably a faster way but using base R

r <- rle(unlist(strsplit("ANNNNNNTCCGGG", ""))) # Compute RLE
m <- rbind(r$values, r$lengths) # Combine
paste(ifelse(m == 1, "", m), collapse="") 

Upvotes: 6

Related Questions