Get all possible permutations of a DNA sequence with an ambiguous base R

Question

Lets say I have a DNA sequence with an ambiguous base, N, where N can represent any base (its a flex position).

dna.seq <- 'ATGCN'

I want a vector of every possible DNA sequence this could represent. It would look like:

c('ATGCA','ATGCT','ATGCG','ATGCC')

The solution needs to account for dna sequences with multiple N characters as well, which will create many more potential DNA sequences.

MichaelChirico · Accepted Answer

CJ from data.table can help you here:

library(data.table)
dna.seq <- 'ATGCN'

# split into components
l = tstrsplit(dna.seq, '', fixed = TRUE)

# replace N with all possibilities
all_bases = c('A', 'T', 'C', 'G')
l = lapply(l, function(x) if (x == 'N') all_bases else x)

# use CJ and reduce to strings:
Reduce(paste0, do.call(CJ, l))
# [1] "ATGCA" "ATGCC" "ATGCG" "ATGCT"

Flexibility to handle multiple N:

dna.seq <- 'ATNCN'
Reduce(paste0, do.call(CJ, l))
#  [1] "ATACA" "ATACC" "ATACG" "ATACT" "ATCCA" "ATCCC" "ATCCG" "ATCCT"
#  [9] "ATGCA" "ATGCC" "ATGCG" "ATGCT" "ATTCA" "ATTCC" "ATTCG" "ATTCT"

If you wanted to drop the data.table dependency you could replace tstrsplit with t(strsplit()) and CJ with expand.grid; you'll just be sacrificing computational speed.

Get all possible permutations of a DNA sequence with an ambiguous base R

Answers (2)

Related Questions