Reputation: 911
So I have a sequence of nucleotides and I need to count the number of times the word gaga appears in the sequence. This is what I have so far:
dna=c("a","g","c","t")
N=16
x=sample(dna,N,4)
x2=paste(x,collapse="")
x2
Here is an example output:
gtaggcctaattataa
Eventually, I am going to write a loop to make this run 100 times and plot a histogram of the counts of the word "gaga." So, my main question is: How can I write a function or code to search through the string x2 and count the number of occurrences of the word "gaga."
Any help would be appreciated! Thank you!
Upvotes: 1
Views: 2610
Reputation: 81693
Here's an approach that counts overlaps too:
vec <- c("gagatttt",
"ttttgaga",
"gaga",
"tttgagattt",
"gagagaga",
"gagaga")
lengths(strsplit(vec, "ga(?=ga)", perl = TRUE)) - 1L
# [1] 1 1 1 1 3 2
Upvotes: 1
Reputation: 16080
Use stri_count_fixed
from stringi
package
dna=c("a","g","c","t")
N=160
x=sample(dna,N,4)
x2 <- stri_paste(x,collapse="")
stri_count_fixed(x2,"gaga")
## 2
Upvotes: 1
Reputation: 109874
This is actually a wrapper for DWin's solution found in the qdap package:
x<- c("gtaggcctaattataa", "gtaggcctaatgagaataa", "gagagaga")
library(qdap)
qdap:::termco.h(x, "gaga", seq_along(x))
## 3 word.count term(gaga)
## 1 1 1 0
## 2 2 1 1
## 3 3 1 2
If you want just the counts:
qdap:::termco.h(x, "gaga", 1:3)[, 3]
Upvotes: 2
Reputation: 263362
?regex
sapply( gregexpr( "gaga", c("gtaggcctaattataa",
"gtaggcctaatgagaataa",
"gagagaga") ) ,
function(x) if( x[1]==-1 ){ 0 }else{ length(x) } )
[1] 0 1 2
Upvotes: 4