anajacintafernandes
anajacintafernandes

Reputation: 95

do not count certain positions in character when replacing certain character positions (in R)

So, I know my title is a little bit confusing but I was hoping you could help me out here.

I have this data frame df where one column is a RNA sequence alignment. The class of this column is a character.
And then I have these other columns: "Allele_1", "Allele_2" which represent the variants of a single position in the RNA sequence (column 1) and that position is given by column 3 ("Position"). However those positions do not account for the "-", i.e., for instance in row 2 the position of the alleles is U--ACCGU--G----UAUUUGAU--CTAD and NOT U--ACCGU--G----UAUUUGAU--CTAD.

sequence                         Allele_1   Allele_2     Position
UAAGGCUCA----UAGGCAGAU--AUaa     A          U            3
U--ACCGU--G----UAUUUGAU--CTAD    C          G            5
cctaACCGU-UUAGCC---------T       U          C            2

The length of the sequence in column 1 can be variable.

What I want to do is to replace specific letters of the character in specific locations given by "position" and the replacement is given by "Allele_1" and "Allele_2". For instance if the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa.

I have tried:

substr(df[,"sequence"], 
  start = df[,"Position"], 
  stop = df[,"Position"]) <- df[,"Allele_1"]

However because my position column does not take into account the "-", it replaces in the wrong place. For instance and back to row 2, it replaces here U--ACCGU--G----UAUUUGAU--CTADinstead of here U--ACCGU--G----UAUUUGAU--CTAD.
Also I haven't figure out how to do "the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa" thing.

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

Really hoping that you can help me figure this out!!

Cheers!

UPDATE: Sorry, it's supposed to be "if the position matches "Allele_1", then I want to replace it by "Allele_2" and vice-versa" and not "Allele_2", then I want to replace it by "Allele_2".

Upvotes: 2

Views: 87

Answers (2)

alistaire
alistaire

Reputation: 43344

Here are two options. Both are case-sensitive and thus don't replace anything in the third sequence. If you don't want them to be, wrap the appropriate variables in the ifelses in toupper.


strsplit

You can split each sequence into a vector of letters, against which you can then check equality directly. Implemented in mapply, the multivariate version of sapply:

df$new_seq <- mapply(function(seq, a1, a2, pos){
    seq <- strsplit(seq, '')[[1]]    # split into letters
    to_replace <- seq[seq != '-'][pos]    # identify allele to replace
    # assign appropriate replacement to subset
    seq[seq != '-'][pos] <- ifelse(a1 == to_replace, 
                                   a2, ifelse(a2 == to_replace, 
                                              a1, to_replace))
    paste(seq, collapse = '')    # reassemble vector to string
}, df$sequence, df$Allele_1, df$Allele_2, df$Position)

df
##                        sequence Allele_1 Allele_2 Position                       new_seq
## 1  UAAGGCUCA----UAGGCAGAU--AUaa        A        U        3  UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD        C        G        5 U--ACCCU--G----UAUUUGAU--CTAD
## 3    cctaACCGU-UUAGCC---------T        U        C        2    cctaACCGU-UUAGCC---------T

If you prefer, you can break the operation into multiple steps, assigning the result of each to a variable.


sub (regex)

If you're comfortable with regex, you can assemble expressions to extract the allele in question and then replace it with the appropriate replacement:

df$to_replace <- mapply(function(seq, pos){
    sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
}, df$sequence, df$Position)

df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
    replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
    sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'), 
        paste0('\\1', replacement, '\\2'), 
        seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)

df[-5]
##                        sequence Allele_1 Allele_2 Position                       new_seq
## 1  UAAGGCUCA----UAGGCAGAU--AUaa        A        U        3  UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD        C        G        5 U--ACCCU--G----UAUUUGAU--CTAD
## 3    cctaACCGU-UUAGCC---------T        U        C        2    cctaACCGU-UUAGCC---------T

Upvotes: 1

ds440
ds440

Reputation: 891

Data

I created a first, small dataframe, and a larger more realistic data frame.

df <- data.frame(sequence = c('UAAGGCUCA----UAGGCAGAU--AUaa',
                                  'U--ACCGU--G----UAUUUGAU--CTAD',
                                  'cctaACCGU-UUAGCC---------T'),
                     Allele_1 = c('A', 'C', 'U'),
                     Allele_2 = c('U', 'G', 'C'),
                     Position = c(3, 5, 2))

df_big <- data.frame(sequence = rep(c(paste(rep('UAAGGCUCA----UAGGCAGAU--AUaa', 100), collapse=''),
                                  paste(rep('U--ACCGU--G----UAUUUGAU--CTAD', 100), collapse=''),
                                  paste(rep('cctaACCGU-UUAGCC---------T', 100), collapse='')), 100),
                 Allele_1 = rep(c('A', 'C', 'U'), 100),
                 Allele_2 = rep(c('U', 'G', 'C'), 100),
                 Position = rep(c(1000, 1000, 1000), 100))

Functions

I created a function to return the 'new' position after the multiple alignment (ie not counting -), that works with vectors, as follows; also a faster but less safe function:

find_pos <- function(string, pos) {
  vectorized <- data.frame(string=string, pos=pos)
  sapply(seq_len(nrow(vectorized)), function(i) {
    guess <- vectorized[i,'pos']; oldguess <- 0
    offset <- -1; trynum <- 0
    while(offset != 0 & trynum < 500) {
      offset <- nchar(gsub('[^-]', '', substr(vectorized[i,'string'], oldguess+1, guess)))
      oldguess <- guess
      guess <- oldguess + offset
      trynum <- trynum+1
    }
    return(guess)
  })
}

find_pos_unsafe <- function(string, pos) {
  sapply(seq_along(string), function(i) {
    guess <- pos[i]; oldguess <- 0
    offset <- -1
    while(offset != 0) {
      offset <- nchar(gsub('[^-]', '', substr(string[i], oldguess+1, guess)))
      oldguess <- guess
      guess <- oldguess + offset
    }
    return(guess)
  })
}

The first can be used with different length variables as follows (but incurs an overhead for this flexibility):

> find_pos(string, 1:5)
[1] 1 4 5 6 7

For benchmarking purposes I've wrapped the other code required to get the solution in a function. Again, two forms, one calling the faster matching function and one calling the safer version:

ds440 <- function(df) {
pos <- find_pos(df$sequence, df$Position)
toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
                 ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
                        substr(df$sequence, pos, pos))) # else keep same

df$replaced <- as.character(df$sequence)
substr(df$replaced, pos, pos) <- as.character(toswap)

df
}

ds440_quick <- function(df) {
  pos <- find_pos_unsafe(df$sequence, df$Position)
  toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
                   ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
                          substr(df$sequence, pos, pos))) # else keep same

  df$replaced <- as.character(df$sequence)
  substr(df$replaced, pos, pos) <- toswap

  df
}

Other functions adapted from @alistaire

alistaire_split <- function(df) {
  df$new_seq <- mapply(function(seq, a1, a2, pos){
    seq <- strsplit(seq, '')[[1]]    # split into letters
    to_replace <- seq[seq != '-'][pos]    # identify allele to replace
    # assign appropriate replacement to subset
    seq[seq != '-'][pos] <- ifelse(a1 == to_replace, 
                                   a2, ifelse(a2 == to_replace, 
                                              a1, to_replace))
    paste(seq, collapse = '')    # reassemble vector to string
  }, as.character(df$sequence), as.character(df$Allele_1), as.character(df$Allele_2), df$Position)

  df
}

alistaire_sub <- function(df) {
  df$to_replace <- mapply(function(seq, pos){
    sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
  }, df$sequence, df$Position)

  df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
    replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
    sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'), 
        paste0('\\1', replacement, '\\2'), 
        seq)
  }, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)

  df[-5]
}

Results

Note the case sensitivity in the matching above. Use toupper or similar before checking for equality if you don't care about case.

ds440(df)
                       sequence Allele_1 Allele_2 Position                      replaced
1  UAAGGCUCA----UAGGCAGAU--AUaa        A        U        3  UAUGGCUCA----UAGGCAGAU--AUaa
2 U--ACCGU--G----UAUUUGAU--CTAD        C        G        5 U--ACCCU--G----UAUUUGAU--CTAD
3    cctaACCGU-UUAGCC---------T        U        C        2    cctaACCGU-UUAGCC---------T

Benchmarking

Functions were defined above.

library(microbenchmark)

microbenchmark(ds440(df), ds440_quick(df), alistaire_split(df), alistaire_sub(df))
Unit: microseconds
                expr     min       lq     mean   median       uq      max neval cld
           ds440(df) 727.157 747.0065 801.3529 764.1475 781.6400 3658.410   100   c
     ds440_quick(df) 339.784 354.7140 364.4484 364.1440 370.7955  421.400   100  b 
 alistaire_split(df) 138.929 144.9500 150.6806 148.5890 153.8570  261.136   100 a  
   alistaire_sub(df) 815.853 833.6630 855.2414 844.0770 857.2130 1499.370   100   c

microbenchmark(ds440(df_big), ds440_quick(df_big), alistaire_split(df_big))
Unit: milliseconds
                    expr      min       lq     mean   median       uq      max neval cld
           ds440(df_big) 195.4233 199.2827 204.4030 204.9039 208.0195 216.1879   100   c
     ds440_quick(df_big) 136.5985 139.3442 143.7216 145.1585 146.7837 153.9201   100 a  
 alistaire_split(df_big) 138.9117 146.3977 150.8772 148.2299 151.0723 278.7308   100  b 

Clear winner for time in the small examle is @alistaire's split function, however as the df gets bigger, the alistaire_sub function breaks (regex won't handle >999) and my ds440_quick function actually works out slightly faster.

Upvotes: 0

Related Questions