Reputation: 1112
I have a data frame like this:
df = read.table(text="REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAG TAAG/TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
T TG T/T -/- TG/TG T/T T/T
CAAAA CAAA CAAAA/CAAAA CAAAA/CAAA CAAAA/CAAAA -/- CAAAA/CAAAA
TTGT TTGTGT TTGT/TTGT TTGT/TTGT TTGT/TTGT TTGTGT/TTGTGT TTGT/TTGTGT
GTTT GTTTTT GTTT/GTTTTT GTTT/GTTT GTTT/GTTT GTTT/GTTT GTTTTT/GTTTTT", header=T, stringsAsFactors=F)
I would like to replace the character elements separated by "/" with either "D" or "I", depending on the length of strings in columns "REF" and "Alt". If the elements match the longest one, they would be replaced by "I", otherwise replaced by "D". But no change for "-". So the result is expected as:
REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG I/I I/D D/D I/I I/I
T TG D/D -/- I/I D/D D/D
CAAAA CAAA I/I I/D I/I -/- I/I
TTGT TTGTGT D/D D/D D/D I/I D/I
GTTT GTTTTT D/I D/D D/D D/D I/I
Upvotes: 4
Views: 297
Reputation: 461
You could create a map with all the combinations of REF
and Alt
with the corresponding combinations of I
and D
:
refalt <- data.frame(
from=c(df$REF, df$Alt),
to=c(rep('I', length(df$REF)), rep('D', length(df$Alt))),
stringsAsFactors=FALSE)
refalt <- rbind(refalt, c('-', '-'))
from <- expand.grid(refalt$from, refalt$from)
to <- expand.grid(refalt$to, refalt$to)
map <- paste(to[,1], to[,2], sep='/')
names(map) <- paste(from[,1], from[,2], sep='/')
Then, you could use the map for each column:
for (name in paste0('S0000', seq(5))) {
df[[name]] <- map[df[[name]]]
}
Upvotes: 0
Reputation: 17611
Here is one approach. I used the stringi
package because it does well with vectors of patterns and vectors of strings to search in.
First establish which string is shorter, which is longer:
short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)
Use those and loop over your columns, assigning a replacement as appropriate. Replace against the long patterns first to avoid issues with strings that match the both the short and long patterns:
library(stringi)
df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns
df[,!(names(df) %in% c("REF", "Alt"))] <-
lapply(1:(ncol(df) - 2),
function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))
# REF Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG TAAG I/I I/D D/D I/I I/I
#2 T TG D/D -/- I/I D/D D/D
#3 CAAAA CAAA I/I I/D I/I -/- I/I
#4 TTGT TTGTGT D/D D/D D/D I/I D/I
#5 GTTT GTTTTT D/I D/D D/D D/D I/I
Upvotes: 4