do not count certain positions in character when replacing certain character positions (in R)

Question

So, I know my title is a little bit confusing but I was hoping you could help me out here.

I have this data frame df where one column is a RNA sequence alignment. The class of this column is a character.
And then I have these other columns: "Allele_1", "Allele_2" which represent the variants of a single position in the RNA sequence (column 1) and that position is given by column 3 ("Position"). However those positions do not account for the "-", i.e., for instance in row 2 the position of the alleles is U--ACCGU--G----UAUUUGAU--CTAD and NOT U--ACCGU--G----UAUUUGAU--CTAD.

sequence                         Allele_1   Allele_2     Position
UAAGGCUCA----UAGGCAGAU--AUaa     A          U            3
U--ACCGU--G----UAUUUGAU--CTAD    C          G            5
cctaACCGU-UUAGCC---------T       U          C            2

The length of the sequence in column 1 can be variable.

What I want to do is to replace specific letters of the character in specific locations given by "position" and the replacement is given by "Allele_1" and "Allele_2". For instance if the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa.

I have tried:

substr(df[,"sequence"], 
  start = df[,"Position"], 
  stop = df[,"Position"]) <- df[,"Allele_1"]

However because my position column does not take into account the "-", it replaces in the wrong place. For instance and back to row 2, it replaces here U--ACCGU--G----UAUUUGAU--CTADinstead of here U--ACCGU--G----UAUUUGAU--CTAD.
Also I haven't figure out how to do "the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa" thing.

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

Really hoping that you can help me figure this out!!

Cheers!

UPDATE: Sorry, it's supposed to be "if the position matches "Allele_1", then I want to replace it by "Allele_2" and vice-versa" and not "Allele_2", then I want to replace it by "Allele_2".

alistaire · Accepted Answer

Here are two options. Both are case-sensitive and thus don't replace anything in the third sequence. If you don't want them to be, wrap the appropriate variables in the ifelses in toupper.

`strsplit`

You can split each sequence into a vector of letters, against which you can then check equality directly. Implemented in mapply, the multivariate version of sapply:

df$new_seq <- mapply(function(seq, a1, a2, pos){
    seq <- strsplit(seq, '')[[1]]    # split into letters
    to_replace <- seq[seq != '-'][pos]    # identify allele to replace
    # assign appropriate replacement to subset
    seq[seq != '-'][pos] <- ifelse(a1 == to_replace, 
                                   a2, ifelse(a2 == to_replace, 
                                              a1, to_replace))
    paste(seq, collapse = '')    # reassemble vector to string
}, df$sequence, df$Allele_1, df$Allele_2, df$Position)

df
##                        sequence Allele_1 Allele_2 Position                       new_seq
## 1  UAAGGCUCA----UAGGCAGAU--AUaa        A        U        3  UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD        C        G        5 U--ACCCU--G----UAUUUGAU--CTAD
## 3    cctaACCGU-UUAGCC---------T        U        C        2    cctaACCGU-UUAGCC---------T

If you prefer, you can break the operation into multiple steps, assigning the result of each to a variable.

`sub` (regex)

If you're comfortable with regex, you can assemble expressions to extract the allele in question and then replace it with the appropriate replacement:

df$to_replace <- mapply(function(seq, pos){
    sub(paste0('(?:-*(?:\w)-*){', pos - 1, '}(\w).*'), '\1', seq)
}, df$sequence, df$Position)

df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
    replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
    sub(paste0('((?:-*(?:\w)-*){', pos - 1, '})\w(.*)'), 
        paste0('\1', replacement, '\2'), 
        seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)

df[-5]
##                        sequence Allele_1 Allele_2 Position                       new_seq
## 1  UAAGGCUCA----UAGGCAGAU--AUaa        A        U        3  UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD        C        G        5 U--ACCCU--G----UAUUUGAU--CTAD
## 3    cctaACCGU-UUAGCC---------T        U        C        2    cctaACCGU-UUAGCC---------T

do not count certain positions in character when replacing certain character positions (in R)

Answers (2)

`strsplit`

`sub` (regex)

Data

Functions

Results

Benchmarking

Related Questions

do not count certain positions in character when replacing certain character positions (in R)

Answers (2)

strsplit

sub (regex)

Data

Functions

Results

Benchmarking

Related Questions

`strsplit`

`sub` (regex)