Reputation: 239
I have some data that's neatly arranged in an R data frame that looks like this:
chr from to value
1 101 200 0.17
1 201 300 0.18
1 301 400 0.17
1 801 900 0.34
1 901 1000 0.37
1 1001 1100 0.36
2 501 600 0.22
2 601 700 0.23
2 1501 1600 0.11
2 1601 1700 0.13
What I would like to do is consolidate the lowest number in the from
column with the greatest number in the to
column but only for contiguous regions within each chromosome, and then take the average of the value
field. So for example, the output should look like this:
chr from to value
1 101 400 0.173
1 801 1100 0.357
2 501 700 0.225
2 1501 1700 0.12
I was wondering if anyone had some thoughts as to how this could be done in R or if there's any package out there that might be able to help me with this kind of thing.
Upvotes: 1
Views: 77
Reputation: 887991
We can use data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
). Grouped by 'chr', we create a new column ('ind') based on the difference in current value of 'to' with next value of 'from', check which element is not equal to 1, pad with TRUE
to get the lengths same, and cumsum
. Using the 'ind', and 'chr' as grouping variables, we get the first value of 'from' (from[1L]
), last of 'to' (to[.N]
), and the mean
of 'value' column. If needed, we can remove the 'ind' by assigning to 'NULL'.
library(data.table)#v1.9.6+
setDT(df1)[, ind:=cumsum(c(TRUE,(from[-1L]-to[-.N])!=1)), by = chr]
df1[,list(from=from[1L], to=to[.N], value=mean(value))
,.(ind, chr)][, ind:= NULL][]
# chr from to value
#1: 1 101 400 0.1733333
#2: 1 801 1100 0.3566667
#3: 2 501 700 0.2250000
#4: 2 1501 1700 0.1200000
Or this can be done in a slghtly more compact way by grouping directly with the cumsum(...)
along with 'chr' and then get the first value of 'from', last of 'to', and the mean
of 'value'. We used shift
to get the lag of 'to' column (by default the type='lag'
in shift
) and compare with 'from' to get the logical vector and did the cumulative sum (cumsum
)
setDT(df1)[,list(from=from[1L], to=to[.N], value=mean(value))
,.(V1=cumsum(from!=shift(to, fill=TRUE)+1L), chr)][, V1:=NULL][]
A similar option using dplyr
would be
library(dplyr)
df1 %>%
group_by(V1=cumsum(from!=lag(to, default=TRUE)+1L), chr) %>%
summarise(from=first(from), to=last(to), value=mean(value)) %>%
ungroup() %>%
select(-V1)
# chr from to value
# (int) (int) (int) (dbl)
#1 1 101 400 0.1733333
#2 1 801 1100 0.3566667
#3 2 501 700 0.2250000
#4 2 1501 1700 0.1200000
df1 <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), from = c(101L, 201L, 301L, 801L, 901L, 1001L, 501L, 601L,
1501L, 1601L), to = c(200L, 300L, 400L, 900L, 1000L, 1100L, 600L,
700L, 1600L, 1700L), value = c(0.17, 0.18, 0.17, 0.34, 0.37,
0.36, 0.22, 0.23, 0.11, 0.13)), .Names = c("chr", "from", "to",
"value"), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 1