Steven
Steven

Reputation: 239

Consolidating rows into regions in data frame in R?

I have some data that's neatly arranged in an R data frame that looks like this:

chr    from    to    value
  1     101   200     0.17
  1     201   300     0.18
  1     301   400     0.17
  1     801   900     0.34
  1     901  1000     0.37
  1    1001  1100     0.36
  2     501   600     0.22
  2     601   700     0.23
  2    1501  1600     0.11
  2    1601  1700     0.13

What I would like to do is consolidate the lowest number in the from column with the greatest number in the to column but only for contiguous regions within each chromosome, and then take the average of the value field. So for example, the output should look like this:

chr    from    to    value
  1     101   400    0.173
  1     801  1100    0.357
  2     501   700    0.225
  2    1501  1700    0.12

I was wondering if anyone had some thoughts as to how this could be done in R or if there's any package out there that might be able to help me with this kind of thing.

Upvotes: 1

Views: 77

Answers (1)

akrun
akrun

Reputation: 887991

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'chr', we create a new column ('ind') based on the difference in current value of 'to' with next value of 'from', check which element is not equal to 1, pad with TRUE to get the lengths same, and cumsum. Using the 'ind', and 'chr' as grouping variables, we get the first value of 'from' (from[1L]), last of 'to' (to[.N]), and the mean of 'value' column. If needed, we can remove the 'ind' by assigning to 'NULL'.

library(data.table)#v1.9.6+
setDT(df1)[, ind:=cumsum(c(TRUE,(from[-1L]-to[-.N])!=1)), by = chr]
df1[,list(from=from[1L], to=to[.N], value=mean(value))
      ,.(ind, chr)][, ind:= NULL][]
#   chr from   to     value
#1:   1  101  400 0.1733333
#2:   1  801 1100 0.3566667
#3:   2  501  700 0.2250000
#4:   2 1501 1700 0.1200000

Or this can be done in a slghtly more compact way by grouping directly with the cumsum(...) along with 'chr' and then get the first value of 'from', last of 'to', and the mean of 'value'. We used shift to get the lag of 'to' column (by default the type='lag' in shift) and compare with 'from' to get the logical vector and did the cumulative sum (cumsum)

setDT(df1)[,list(from=from[1L], to=to[.N], value=mean(value)) 
   ,.(V1=cumsum(from!=shift(to, fill=TRUE)+1L), chr)][, V1:=NULL][]

A similar option using dplyr would be

library(dplyr)
df1 %>% 
    group_by(V1=cumsum(from!=lag(to, default=TRUE)+1L), chr) %>%
    summarise(from=first(from), to=last(to), value=mean(value)) %>% 
    ungroup() %>%
    select(-V1)
#    chr  from    to     value
#   (int) (int) (int)     (dbl)
#1     1   101   400 0.1733333
#2     1   801  1100 0.3566667
#3     2   501   700 0.2250000
#4     2  1501  1700 0.1200000

data

df1 <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), from = c(101L, 201L, 301L, 801L, 901L, 1001L, 501L, 601L, 
1501L, 1601L), to = c(200L, 300L, 400L, 900L, 1000L, 1100L, 600L, 
700L, 1600L, 1700L), value = c(0.17, 0.18, 0.17, 0.34, 0.37, 
0.36, 0.22, 0.23, 0.11, 0.13)), .Names = c("chr", "from", "to", 
"value"), class = "data.frame", row.names = c(NA, -10L))

Upvotes: 1

Related Questions