Reputation: 10199
I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list.
Lets say my list includes 5 members:
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
I want set k=2
(2-mer) so 4^2=16
combination are available including AA,AT,AC,AG,TA,TT,...
So my table will have 5 rows
and 16 columns
. I want to count number of matches between my k-mers and list members.
My desired result: df:
lstMemb AA AT AC AG TA TT TC ...
1 2 1 1 0 0 3 0
2 ...
3
4
5
Could you help me implement this in R?
Upvotes: 6
Views: 5228
Reputation:
Another way to do this:
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA")
len <- 4
stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len))
sapply(DNAlst, stri_sub_fun)
[[1]]
AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT
1 1 1 1 1 1 1 1 1
[[2]]
AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[[3]]
ATGC ATTA TATG TTAT
1 1 1 1
[[4]]
TGGA
1
[[5]]
ATCA CATC CGCA CGCG GCAT GCGC TCAA
1 1 1 1 1 1 1
[[6]]
ACAC ACCA CACA CACC
4 1 3 1
Upvotes: 2
Reputation: 16080
If you are looking for speed the obvious solution is stringi
package.
There is stri_count_fixed
function for counting patterns.
And now, check the code and benchmark!
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
fstri <- function(x){
t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)}))
}
all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE
longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
expr min lq mean median uq max neval
fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100
fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100
127245.489/766.862
## [1] 165.9301
Ca 165x times faster :)
Upvotes: 7
Reputation: 845
We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too:
#installing kebabs:
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("kebabs", "Biostrings"))
library(kebabs)
s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s1, 200), collate=""))
s2
sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):
sv <- c(DNAStringSet(s1), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.
#for the kebabs documentation please see:
browseVignettes("kebabs")
I hope that helps. If you have any further questions, please let me know.
Best regards, Ulrich
Upvotes: 6
Reputation: 662
My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D
stri_count_fixed
stringi
package will get really slow for big k-mers since it
has to generate all possible combinations for pattern and afterwards,
check their existence in the data and count how many times it appears.k
instead of creating a pattern string. oligonucleotideFrequency
with a k
bigger than 12
in a big sequence, the function freezes for excess of memory use and
R is restarted, while with my function it runs pretty fast.sequence_kmers <- function(sequence, k){
k_mers <- lapply(sequence,function(x){
seq_loop_size <- length(DNAString(x))-k+1
kmers <- sapply(1:seq_loop_size, function(z){
y <- z + k -1
kmer <- substr(x=x, start=z, stop=y)
return(kmer)
})
return(kmers)
})
uniq <- unique(unlist(k_mers))
ind <- t(sapply(k_mers, function(x){
tabulate(match(x, uniq), length(uniq))
}))
colnames(ind) <- uniq
return(ind)
}
I use the Biostrings
package only to count the bases... you can use other options like stringi
to count...
if you remove all code below k_mers lapply
and return(k_mers)
it returns just the list... of all k-mers with the respective repeated vectors
sequence
here is a sequence of 1000bp#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
4 4 3 4 4 8 6 4 5 5
> sequence_kmers(c(sequence,sequence),4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,] 4 4 3 4 4 8 6 4 5 5
[2,] 4 4 3 4 4 8 6 4 5 5
Tests done with my function:
#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
usuário sistema decorrido
0.08 0.00 0.08
#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
user system elapsed
3.61 0.00 3.61
#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
user system elapsed
3.63 0.00 3.62
Tests done with Biostrings
:
#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
user system elapsed
150.11 1.14 151.37
#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})
freezes, used all my 8gb RAM
Upvotes: 5
Reputation: 887203
May be this helps
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
t(sapply(DNAlst, function(x){x1 <- DNAString(x)
oligonucleotideFrequency(x1,2)}))
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
Or as suggested by @Arun, convert the list
to vector
first
oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
Upvotes: 7