Patrick
Patrick

Reputation: 45

Creating a for loop for a dataframe

I have a list of vectors stored

library(seqinr) mydata <- read.fasta(file="mydata.fasta")
mydatavec <- mydata[[1]] 

lst <- split(mydatavec, as.integer(gl(length(mydatavec), 100,length(mydatavec))))

df <- data.frame(matrix(unlist(lst), nrow=2057, byrow=T), stringsAsFactors=FALSE)

Now, each vector in df is 100 long and made up of letters "a", "c", "g", "t". I would like to calculate Shannon entropy of each of these vector, I will give example of what I mean:

v1 <- count(df[1,], 1) 
a  c  g  t 
27 26 24 23     

v2 <- v1/sum(v1) 
  a    c    g    t 
0.27 0.26 0.24 0.23 

v3 <- -sum(log(v2)*v2) ; print(v3) 
[1]1.384293

In total I need 2057 printed values because that is how many vectors I have. My question here, is it possible to create a for loop or repeat loop that would do this operation for me? I tried myself but I didn't get nowhere with this.

dput(head(sequence))
structure(c("function (nvec) ", "unlist(lapply(nvec, seq_len))"
), .Dim = c(2L, 1L), .Dimnames = list(c("1", "2"), ""), class = "noquote")

My attempt: I wanted to focus on the count function only and created this

A <- matrix(0, 2, 4)

for (i in 1:2) {
  A[i] <- count(df[i,], 1)
}

What the function does is it correctly calculates number of "a" in the first vector and then follows to the second one. It completely ignores the rest of the letters

A
     [,1] [,2] [,3] [,4]
[1,]   27    0    0    0
[2,]   28    0    0    0

Additionally I naively thought that adding bunch of "i" everywhere will make it work

s <- matrix(0, 1, 4)
s1 <- matrix(0, 1, 4)
s2 <- numeric(4)

for (i in 1:2) {
  s[i] <- count(df[i,],1)
  s1[i] <- s[i]/sum(s[i])
  s2[i] <- -sum(log(s1[i])*s1[i])
}

But that didn't get me anywhere either.

Upvotes: 1

Views: 94

Answers (2)

Katia
Katia

Reputation: 3914

Would this work for you:

df <- data.frame (x = c("a","c","g","g","g"), 
                  y = c("g","c","a","a","g"), 
                  z = c("g","t","t","a","g"),stringsAsFactors=FALSE)


A <- sapply(1:nrow(df), FUN=function(i){count(df[i,],1)})

> A
  [,1] [,2] [,3] [,4] [,5]
a    1    0    1    2    0
c    0    2    0    0    0
g    2    0    1    1    3
t    0    1    1    0    0

Upvotes: 1

Santiago I. Hurtado
Santiago I. Hurtado

Reputation: 1123

If you don't need to save the count and you only need to print or save the calculation you show, these should work:

for(i in 1:dim(df)[1]{
    v1 <- count(df[i,], 1) 
    v2 <- v1/sum(v1) 
    v3 <- sum(log(v2)*v2)
    print(-v3) #for print
    entropy[i] <- v3 #for save the value in a vector, first create this vector

}

The problem with the loop that you show may be the output of count is a table class with 1 row and 4 columns and you assign that to a matrix row. Also another possible problem may be that in the assignment for example you declare s[i] <- count(df[i,],1), when should be s[i,] <- count(df[i,],1).

Upvotes: 2

Related Questions