Reputation: 95
So, I know my title is a little bit confusing but I was hoping you could help me out here.
I have this data frame df
where one column is a RNA sequence alignment. The class of this column is a character.
And then I have these other columns: "Allele_1", "Allele_2" which represent the variants of a single position in the RNA sequence (column 1) and that position is given by column 3 ("Position"). However those positions do not account for the "-", i.e., for instance in row 2 the position of the alleles is U--ACC
GU--G----UAUUUGAU--CTAD
and NOT U--A
CCGU--G----UAUUUGAU--CTAD
.
sequence Allele_1 Allele_2 Position
UAAGGCUCA----UAGGCAGAU--AUaa A U 3
U--ACCGU--G----UAUUUGAU--CTAD C G 5
cctaACCGU-UUAGCC---------T U C 2
The length of the sequence in column 1 can be variable.
What I want to do is to replace specific letters of the character in specific locations given by "position" and the replacement is given by "Allele_1" and "Allele_2". For instance if the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa.
I have tried:
substr(df[,"sequence"],
start = df[,"Position"],
stop = df[,"Position"]) <- df[,"Allele_1"]
However because my position column does not take into account the "-", it replaces in the wrong place. For instance and back to row 2, it replaces here U--A
CCGU--G----UAUUUGAU--CTAD
instead of here U--ACC
GU--G----UAUUUGAU--CTAD
.
Also I haven't figure out how to do "the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa" thing.
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
Really hoping that you can help me figure this out!!
Cheers!
UPDATE: Sorry, it's supposed to be "if the position matches "Allele_1", then I want to replace it by "Allele_2" and vice-versa" and not "Allele_2", then I want to replace it by "Allele_2".
Upvotes: 2
Views: 87
Reputation: 43344
Here are two options. Both are case-sensitive and thus don't replace anything in the third sequence. If you don't want them to be, wrap the appropriate variables in the ifelse
s in toupper
.
strsplit
You can split each sequence into a vector of letters, against which you can then check equality directly. Implemented in mapply
, the multivariate version of sapply
:
df$new_seq <- mapply(function(seq, a1, a2, pos){
seq <- strsplit(seq, '')[[1]] # split into letters
to_replace <- seq[seq != '-'][pos] # identify allele to replace
# assign appropriate replacement to subset
seq[seq != '-'][pos] <- ifelse(a1 == to_replace,
a2, ifelse(a2 == to_replace,
a1, to_replace))
paste(seq, collapse = '') # reassemble vector to string
}, df$sequence, df$Allele_1, df$Allele_2, df$Position)
df
## sequence Allele_1 Allele_2 Position new_seq
## 1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
## 3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
If you prefer, you can break the operation into multiple steps, assigning the result of each to a variable.
sub
(regex)If you're comfortable with regex, you can assemble expressions to extract the allele in question and then replace it with the appropriate replacement:
df$to_replace <- mapply(function(seq, pos){
sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
}, df$sequence, df$Position)
df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'),
paste0('\\1', replacement, '\\2'),
seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)
df[-5]
## sequence Allele_1 Allele_2 Position new_seq
## 1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
## 3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
Upvotes: 1
Reputation: 891
I created a first, small dataframe, and a larger more realistic data frame.
df <- data.frame(sequence = c('UAAGGCUCA----UAGGCAGAU--AUaa',
'U--ACCGU--G----UAUUUGAU--CTAD',
'cctaACCGU-UUAGCC---------T'),
Allele_1 = c('A', 'C', 'U'),
Allele_2 = c('U', 'G', 'C'),
Position = c(3, 5, 2))
df_big <- data.frame(sequence = rep(c(paste(rep('UAAGGCUCA----UAGGCAGAU--AUaa', 100), collapse=''),
paste(rep('U--ACCGU--G----UAUUUGAU--CTAD', 100), collapse=''),
paste(rep('cctaACCGU-UUAGCC---------T', 100), collapse='')), 100),
Allele_1 = rep(c('A', 'C', 'U'), 100),
Allele_2 = rep(c('U', 'G', 'C'), 100),
Position = rep(c(1000, 1000, 1000), 100))
I created a function to return the 'new' position after the multiple alignment (ie not counting -
), that works with vectors, as follows; also a faster but less safe function:
find_pos <- function(string, pos) {
vectorized <- data.frame(string=string, pos=pos)
sapply(seq_len(nrow(vectorized)), function(i) {
guess <- vectorized[i,'pos']; oldguess <- 0
offset <- -1; trynum <- 0
while(offset != 0 & trynum < 500) {
offset <- nchar(gsub('[^-]', '', substr(vectorized[i,'string'], oldguess+1, guess)))
oldguess <- guess
guess <- oldguess + offset
trynum <- trynum+1
}
return(guess)
})
}
find_pos_unsafe <- function(string, pos) {
sapply(seq_along(string), function(i) {
guess <- pos[i]; oldguess <- 0
offset <- -1
while(offset != 0) {
offset <- nchar(gsub('[^-]', '', substr(string[i], oldguess+1, guess)))
oldguess <- guess
guess <- oldguess + offset
}
return(guess)
})
}
The first can be used with different length variables as follows (but incurs an overhead for this flexibility):
> find_pos(string, 1:5)
[1] 1 4 5 6 7
For benchmarking purposes I've wrapped the other code required to get the solution in a function. Again, two forms, one calling the faster matching function and one calling the safer version:
ds440 <- function(df) {
pos <- find_pos(df$sequence, df$Position)
toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
substr(df$sequence, pos, pos))) # else keep same
df$replaced <- as.character(df$sequence)
substr(df$replaced, pos, pos) <- as.character(toswap)
df
}
ds440_quick <- function(df) {
pos <- find_pos_unsafe(df$sequence, df$Position)
toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
substr(df$sequence, pos, pos))) # else keep same
df$replaced <- as.character(df$sequence)
substr(df$replaced, pos, pos) <- toswap
df
}
Other functions adapted from @alistaire
alistaire_split <- function(df) {
df$new_seq <- mapply(function(seq, a1, a2, pos){
seq <- strsplit(seq, '')[[1]] # split into letters
to_replace <- seq[seq != '-'][pos] # identify allele to replace
# assign appropriate replacement to subset
seq[seq != '-'][pos] <- ifelse(a1 == to_replace,
a2, ifelse(a2 == to_replace,
a1, to_replace))
paste(seq, collapse = '') # reassemble vector to string
}, as.character(df$sequence), as.character(df$Allele_1), as.character(df$Allele_2), df$Position)
df
}
alistaire_sub <- function(df) {
df$to_replace <- mapply(function(seq, pos){
sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
}, df$sequence, df$Position)
df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'),
paste0('\\1', replacement, '\\2'),
seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)
df[-5]
}
Note the case sensitivity in the matching above. Use toupper
or similar before checking for equality if you don't care about case.
ds440(df)
sequence Allele_1 Allele_2 Position replaced
1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
Functions were defined above.
library(microbenchmark)
microbenchmark(ds440(df), ds440_quick(df), alistaire_split(df), alistaire_sub(df))
Unit: microseconds
expr min lq mean median uq max neval cld
ds440(df) 727.157 747.0065 801.3529 764.1475 781.6400 3658.410 100 c
ds440_quick(df) 339.784 354.7140 364.4484 364.1440 370.7955 421.400 100 b
alistaire_split(df) 138.929 144.9500 150.6806 148.5890 153.8570 261.136 100 a
alistaire_sub(df) 815.853 833.6630 855.2414 844.0770 857.2130 1499.370 100 c
microbenchmark(ds440(df_big), ds440_quick(df_big), alistaire_split(df_big))
Unit: milliseconds
expr min lq mean median uq max neval cld
ds440(df_big) 195.4233 199.2827 204.4030 204.9039 208.0195 216.1879 100 c
ds440_quick(df_big) 136.5985 139.3442 143.7216 145.1585 146.7837 153.9201 100 a
alistaire_split(df_big) 138.9117 146.3977 150.8772 148.2299 151.0723 278.7308 100 b
Clear winner for time in the small examle is @alistaire
's split function, however as the df gets bigger, the alistaire_sub
function breaks (regex won't handle >999) and my ds440_quick
function actually works out slightly faster.
Upvotes: 0