littleworth
littleworth

Reputation: 5169

How to replace the wild card characters with sampled characters in R

I have the following sequence:

s0 <- "KDRH?THLA???RT?HLAK"

The wild card character there is indicated by ?. What I want to do is to replace that character by sampled character from this vector:

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

Since s0 has 5 wild cards ?, I would sample from AADict:

set.seed(1)
nof_wildcard <- 5
tolower(sample(AADict, nof_wildcard, TRUE))

Which gives [1] "d" "q" "a" "r" "l"

Hence the expected result is:

     KDRH?THLA???RT?HLAK
     KDRHdTHLAqarRTlHLAK

So the placement of the sampled character must be exactly in the same position as ?, but the order of the character is not important. e.g. this answer is also acceptable: KDRHqTHLAdlaRTrHLAK.

How can I achieve that with R?

The other example are:

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"

Upvotes: 4

Views: 72

Answers (3)

Rui Barradas
Rui Barradas

Reputation: 76611

Here is a vectorized function to replace the "?" characters in a vector of strings.

fun <- function(x, dict = AADict) {
  dict <- tolower(dict)
  inx <- gregexpr("\\?", x)
  sapply(seq_along(x), \(j) {
    for(i in inx[[j]]) {
      substr(x[j], i, i) <- sample(dict, 1L)
    }
    x[j]
  })
}

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

s0 <- "KDRH?THLA???RT?HLAK"
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"

fun(s0)
#> [1] "KDRHsTHLAwppRTwHLAK"

fun(s1)
#> [1] "FKDHKHIDVKDRHRTHLAKyfqfRTRHLAK"

fun(s2)
#> [1] "FKHIDVKDRHRTRHLAKnsfehqwmkv"

fun(c(s0, s1, s2))
#> [1] "KDRHiTHLAdssRTgHLAK"            "FKDHKHIDVKDRHRTHLAKcdivRTRHLAK"
#> [3] "FKHIDVKDRHRTRHLAKfrpafwpnif"

Created on 2022-10-22 with reprex v2.0.2

Upvotes: 3

jared_mamrot
jared_mamrot

Reputation: 26685

One approach is to replace the "?" characters 'one at a time' using a loop, e.g.

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0
#> [1] "KDRH?THLA???RT?HLAK"
repeat{s0 <- sub("\\?", sample(tolower(AADict), 1), s0); if(grepl("\\?", s0) == FALSE) break}
s0
#> [1] "KDRHtTHLAidwRTyHLAK"

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
repeat{s1 <- sub("\\?", sample(tolower(AADict), 1), s1); if(grepl("\\?", s1) == FALSE) break}
s1
#> [1] "FKDHKHIDVKDRHRTHLAKrstaRTRHLAK"

s2 <- "FKHIDVKDRHRTRHLAK??????????"
repeat{s2 <- sub("\\?", sample(tolower(AADict), 1), s2); if(grepl("\\?", s2) == FALSE) break}
s2
#> [1] "FKHIDVKDRHRTRHLAKdvcfmheiqn"

Another approach which can also allow for sampling without replacement:

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
matches <- gregexpr("\\?", s0)
regmatches(s0, matches) <- lapply(lengths(matches), sample, x = tolower(AADict), replace = FALSE)
s0
#> [1] "KDRHdTHLAlanRTiHLAK"

Created on 2022-10-22 by the reprex package (v2.0.1)

Upvotes: 4

stefan
stefan

Reputation: 125268

You could split your string in single characters which makes it easy to replace the wildcard without the need of a loop (was my first approach):

replace_wc <- function(x, dict) {
  x <- strsplit(x, split = "")[[1]]
  ix <- grepl("\\?", x)
  x[ix] <- sample(dict, sum(ix), replace = TRUE)

  return(paste0(x, collapse = ""))
}

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c(
  "A", "R", "N", "D", "C", "E", "Q", "G", "H",
  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)

set.seed(1)

replace_wc(s0, tolower(AADict))
#> [1] "KDRHdTHLAqarRTlHLAK"

Upvotes: 3

Related Questions