Reputation: 608
I have a bunch of biological sequence data and I need to make a count matrix for the counts of each letter to letter transition -> i.e. A followed by A, A followed by T, ..., T followed by T
I couldn't find a package to make a 4x4 matrix automatically from my data so I have been going about it manually by finding the counts in each sequence of each 2 letter combination. However, I now need to add all the different 2-letter counts up by index -> i.e. index 1 of AA + index 1 of AT + ... index 1 of TT and so on until all indexes are done and that is where I am lost.
Code to get my 2-letter counts:
AA <- str_count(data$Sequence, "AA"); AC <- str_count(data$Sequence, "AC")
AG <- str_count(data$Sequence, "AG"); AT <- str_count(data$Sequence, "AT")
CA <- str_count(data$Sequence, "CA"); CC <- str_count(data$Sequence, "CC")
CG <- str_count(data$Sequence, "CG"); CT <- str_count(data$Sequence, "CT")
GA <- str_count(data$Sequence, "GA"); GC <- str_count(data$Sequence, "GC")
GG <- str_count(data$Sequence, "GG"); GT <- str_count(data$Sequence, "GT")
TA <- str_count(data$Sequence, "TA"); TC <- str_count(data$Sequence, "TC")
TG <- str_count(data$Sequence, "TG"); TT <- str_count(data$Sequence, "TT")
I am open to outside packages/functions that may solve this problem as well as any that may accomplish the above code more efficiently
Upvotes: 2
Views: 152
Reputation: 46908
You can use Biostrings :
library(Biostrings)
data = data.frame(Sequence=c("AGGATC","GTCCCA"))
dinucleotideFrequency(DNAStringSet(as.character(data$Sequence)))
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0
[2,] 0 0 0 0 1 2 0 0 0 0 0 1 0 1 0 0
Upvotes: 3
Reputation: 5254
This one gives you a count for each cell of data$Sequence
.
require(stringr)
data <- data.frame(Sequence = c("AAGGATA", "TAAGCAA"))
Couples <- paste0(rep(c("A", "C", "G", "T"),4), rep(c("A", "C", "G", "T"), each=4))
sapply(Couples, function(x) str_count(data$Sequence, x))
For a total count add
colSums( sapply(Couples, function(x) str_count(data$Sequence, x)) )
Upvotes: 1