Reputation: 45
I have a list of vectors stored
library(seqinr) mydata <- read.fasta(file="mydata.fasta")
mydatavec <- mydata[[1]]
lst <- split(mydatavec, as.integer(gl(length(mydatavec), 100,length(mydatavec))))
df <- data.frame(matrix(unlist(lst), nrow=2057, byrow=T), stringsAsFactors=FALSE)
Now, each vector in df is 100 long and made up of letters "a", "c", "g", "t". I would like to calculate Shannon entropy of each of these vector, I will give example of what I mean:
v1 <- count(df[1,], 1)
a c g t
27 26 24 23
v2 <- v1/sum(v1)
a c g t
0.27 0.26 0.24 0.23
v3 <- -sum(log(v2)*v2) ; print(v3)
[1]1.384293
In total I need 2057 printed values because that is how many vectors I have. My question here, is it possible to create a for loop or repeat loop that would do this operation for me? I tried myself but I didn't get nowhere with this.
dput(head(sequence))
structure(c("function (nvec) ", "unlist(lapply(nvec, seq_len))"
), .Dim = c(2L, 1L), .Dimnames = list(c("1", "2"), ""), class = "noquote")
My attempt: I wanted to focus on the count function only and created this
A <- matrix(0, 2, 4)
for (i in 1:2) {
A[i] <- count(df[i,], 1)
}
What the function does is it correctly calculates number of "a" in the first vector and then follows to the second one. It completely ignores the rest of the letters
A
[,1] [,2] [,3] [,4]
[1,] 27 0 0 0
[2,] 28 0 0 0
Additionally I naively thought that adding bunch of "i" everywhere will make it work
s <- matrix(0, 1, 4)
s1 <- matrix(0, 1, 4)
s2 <- numeric(4)
for (i in 1:2) {
s[i] <- count(df[i,],1)
s1[i] <- s[i]/sum(s[i])
s2[i] <- -sum(log(s1[i])*s1[i])
}
But that didn't get me anywhere either.
Upvotes: 1
Views: 94
Reputation: 3914
Would this work for you:
df <- data.frame (x = c("a","c","g","g","g"),
y = c("g","c","a","a","g"),
z = c("g","t","t","a","g"),stringsAsFactors=FALSE)
A <- sapply(1:nrow(df), FUN=function(i){count(df[i,],1)})
> A
[,1] [,2] [,3] [,4] [,5]
a 1 0 1 2 0
c 0 2 0 0 0
g 2 0 1 1 3
t 0 1 1 0 0
Upvotes: 1
Reputation: 1123
If you don't need to save the count and you only need to print or save the calculation you show, these should work:
for(i in 1:dim(df)[1]{
v1 <- count(df[i,], 1)
v2 <- v1/sum(v1)
v3 <- sum(log(v2)*v2)
print(-v3) #for print
entropy[i] <- v3 #for save the value in a vector, first create this vector
}
The problem with the loop that you show may be the output of count is a table class with 1 row and 4 columns and you assign that to a matrix row. Also another possible problem may be that in the assignment for example you declare s[i] <- count(df[i,],1), when should be s[i,] <- count(df[i,],1).
Upvotes: 2