user3354212
user3354212

Reputation: 1112

replace partial of character string in a data frame by conditions in r

I have a data frame like this:

df = read.table(text="REF   Alt S00001  S00002  S00003  S00004  S00005
 TAAGAAG    TAAG    TAAGAAG/TAAGAAG TAAGAAG/TAAG    TAAG/TAAG   TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
 T  TG  T/T -/- TG/TG   T/T T/T
 CAAAA  CAAA    CAAAA/CAAAA CAAAA/CAAA  CAAAA/CAAAA -/- CAAAA/CAAAA
 TTGT   TTGTGT  TTGT/TTGT   TTGT/TTGT   TTGT/TTGT   TTGTGT/TTGTGT   TTGT/TTGTGT
 GTTT   GTTTTT  GTTT/GTTTTT GTTT/GTTT   GTTT/GTTT   GTTT/GTTT   GTTTTT/GTTTTT", header=T, stringsAsFactors=F)

I would like to replace the character elements separated by "/" with either "D" or "I", depending on the length of strings in columns "REF" and "Alt". If the elements match the longest one, they would be replaced by "I", otherwise replaced by "D". But no change for "-". So the result is expected as:

REF Alt S00001  S00002  S00003  S00004  S00005
TAAGAAG TAAG    I/I I/D D/D I/I I/I
T   TG  D/D -/- I/I D/D D/D
CAAAA   CAAA    I/I I/D I/I -/- I/I
TTGT    TTGTGT  D/D D/D D/D I/I D/I
GTTT    GTTTTT  D/I D/D D/D D/D I/I

Upvotes: 4

Views: 297

Answers (2)

cr3
cr3

Reputation: 461

You could create a map with all the combinations of REF and Alt with the corresponding combinations of I and D:

refalt <- data.frame(
    from=c(df$REF, df$Alt),
    to=c(rep('I', length(df$REF)), rep('D', length(df$Alt))),
    stringsAsFactors=FALSE)
refalt <- rbind(refalt, c('-', '-'))
from <- expand.grid(refalt$from, refalt$from)
to <- expand.grid(refalt$to, refalt$to)
map <- paste(to[,1], to[,2], sep='/')
names(map) <- paste(from[,1], from[,2], sep='/')

Then, you could use the map for each column:

for (name in paste0('S0000', seq(5))) {
    df[[name]] <- map[df[[name]]]
}

Upvotes: 0

Jota
Jota

Reputation: 17611

Here is one approach. I used the stringi package because it does well with vectors of patterns and vectors of strings to search in.

First establish which string is shorter, which is longer:

short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)

Use those and loop over your columns, assigning a replacement as appropriate. Replace against the long patterns first to avoid issues with strings that match the both the short and long patterns:

library(stringi)

df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
  lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns

df[,!(names(df) %in% c("REF", "Alt"))] <- 
  lapply(1:(ncol(df) - 2),
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))

#      REF    Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG   TAAG    I/I    I/D    D/D    I/I    I/I
#2       T     TG    D/D    -/-    I/I    D/D    D/D
#3   CAAAA   CAAA    I/I    I/D    I/I    -/-    I/I
#4    TTGT TTGTGT    D/D    D/D    D/D    I/I    D/I
#5    GTTT GTTTTT    D/I    D/D    D/D    D/D    I/I

Upvotes: 4

Related Questions