rt11
rt11

Reputation: 45

replace characters in string based on positions from another variable R

I have the below dataframe xo. For each row, I want to find and replace the positions listed in positions_of_Ns_to_remove in sequence. The results new variable in the example should be sequence with all R's removed. I cannot search based on the character itself in this situation - it must be based on the position of the character.

p <- data.frame(locus = c("1","2","3"), positions_of_Ns_to_remove = c("12,17,43,100","30,60,61,62",NA))
x <- data.frame(locus = c("1","1","2","3"), sequence = c("xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR","xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxRRRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"))
xo <- merge(x, p, by = c("locus"), all.x = T)

> xo
  locus                                                                                             sequence positions_of_Ns_to_remove
1     1 xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR              12,17,43,100
2     1 xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR              12,17,43,100
3     2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxRRRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx               30,60,61,62
4     3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                      <NA>

This works if there is only 1 row in xo, but not when there are multiple rows. I would like to use tidyverse functions / piping and avoid for loops if possible.

  xo %>% dplyr::mutate(new_sequence = paste(
                                                    replace( unlist(strsplit(sequence, "")), as.integer(unlist(strsplit(positions_of_Ns_to_remove,","))), "" ), 
                                                   collapse = "")
                             )

What I want:

  locus                                                                                             new_sequence positions_of_Ns_to_remove
1     1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx              12,17,43,100
2     1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx              12,17,43,100
3     2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx               30,60,61,62
4     3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                      <NA>

Upvotes: 1

Views: 219

Answers (1)

Martin Gal
Martin Gal

Reputation: 16988

You could build a custom function and apply it to your data:

library(stringr)

# cuts the n-th character out of the string
remove_pos <- function(string, n) {
  n <- as.integer(n)
  n <- n[order(n, decreasing = TRUE)]
  len <- nchar(string)
  
  output <- string
  
  for (i in n) {
    
    output <- paste0(
      str_sub(output, start = 1L, end = i - 1L),
      str_sub(output, start = i + 1, end = len)
      )
  }
  
  return(output)
  
}

xo %>% 
  mutate(positions = str_split(positions_of_Ns_to_remove, ",")) %>% 
  group_by(locus, n=row_number()) %>%
  mutate(
    new_seq = ifelse(!is.na(positions_of_Ns_to_remove), 
                     remove_pos(sequence, unlist(positions)), 
                     sequence)
    ) %>% 
  select(-positions) %>% 
  ungroup()

which returns

# A tibble: 5 x 4
  locus sequence                                    positions_of_Ns_to~ new_seq                                  
  <chr> <chr>                                       <chr>               <chr>                                    
1 1     xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxx~ 12,17,43,100        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
2 1     xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxx~ 12,17,43,100        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
3 2     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxx~ 30,60,61,62         xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
4 3     Rxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~ 1                   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
5 4     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~ NA                  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~

Upvotes: 1

Related Questions