fugu
fugu

Reputation: 6578

lapply alternative to for loop to append to data frame

I have a data frame:

df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
                                       2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), 
                     pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L, 
                             45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")

> head(df)
  chrom  pos
1     1   10
2     1  200
3     1  134
4     1  400
5     1  600
6     1 1000

And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)

By using a for loop over each chrom level, and another over each row I get the expected results:

 for (c in levels(df$chrom)){
    df_chrom<-filter(df, chrom == c)
    df_chrom<-arrange(df_chrom, df_chrom$pos)

    for (i in 1:nrow(df_chrom)){
      dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
      logdist<-log10(dist)
      cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
    }
  }

However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.

Any pointers would be appreciated

Here's the output from my solution:

chrom index pos dist log10dist
1 1 10 124 2.093422 
1 2 134 66 1.819544 
1 3 200 200 2.30103 
1 4 400 200 2.30103 
1 5 600 400 2.60206 
1 6 1000 NA NA 
2 1 20 13 1.113943 
2 2 33 NA NA 
3 1 40 5 0.69897 
3 2 45 NA NA 
4 1 50 5 0.69897 
4 2 55 45 1.653213 
4 3 100 23 1.361728 
4 4 123 NA NA 

Upvotes: 0

Views: 911

Answers (2)

d.b
d.b

Reputation: 32548

We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.

do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
                      function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
#     chrom  pos     dist
#1.1      1   10 2.093422
#1.3      1  134 1.819544
#1.2      1  200 2.301030
#1.4      1  400 2.301030
#1.5      1  600 2.602060
#1.6      1 1000       NA
#2.7      2   20 1.113943
#2.8      2   33       NA
#3.9      3   40 0.698970
#3.10     3   45       NA
#4.11     4   50 0.698970
#4.12     4   55 1.653213
#4.13     4  100 1.361728
#4.14     4  123       NA

Upvotes: 1

akrun
akrun

Reputation: 887531

We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference

library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
                   .(index = seq_len(.N), pos = pos, 
                      dist = c(v1, NA), logdiff = c(log10(v1), NA))}

               , by = chrom]
#    chrom index  pos dist  logdiff
# 1:     1     1   10  124 2.093422
# 2:     1     2  134   66 1.819544
# 3:     1     3  200  200 2.301030
# 4:     1     4  400  200 2.301030
# 5:     1     5  600  400 2.602060
# 6:     1     6 1000   NA       NA
# 7:     2     1   20   13 1.113943
# 8:     2     2   33   NA       NA
# 9:     3     1   40    5 0.698970
#10:     3     2   45   NA       NA
#11:     4     1   50    5 0.698970
#12:     4     2   55   45 1.653213
#13:     4     3  100   23 1.361728
#14:     4     4  123   NA       NA

Upon running the OP's code the output printed are

#1 1 10 124 2.093422 
#1 2 134 66 1.819544 
#1 3 200 200 2.30103 
#1 4 400 200 2.30103 
#1 5 600 400 2.60206 
#1 6 1000 NA NA 
#2 1 20 13 1.113943 
#2 2 33 NA NA 
#3 1 40 5 0.69897 
#3 2 45 NA NA 
#4 1 50 5 0.69897 
#4 2 55 45 1.653213 
#4 3 100 23 1.361728 
#4 4 123 NA NA 

Upvotes: 1

Related Questions