Reputation: 970
I'm starting with a data.frame of genome ranges (a chromosome, and a start - end position). I'm trying to combine rows that are 1) adjacently positioned and 2) that share a value in two other columns. Note: I'd like an efficient method since my real data is > 10 million rows. (data.table if possible)
Toy data:
DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
Chr = c(1,1,1,1,2,1,1),
Start = c(1, 101, 201, 401, 500, 1, 101),
End = c(100, 200, 300, 499, 599, 100, 200),
State = c(3,3,2,3,3,2,2)
)
DF
SampleID Chr Start End State
1: 1 1 1 100 3
2: 1 1 101 200 3
3: 1 1 201 300 2
4: 1 1 401 499 3
5: 1 2 500 501 3
6: 2 1 1 100 2
7: 2 1 101 200 2
Lines 1 & 2 can be combined because they are adjacent (1-100 & 101-200) and share a SampleID
(1) and State
(3).
The following cannot be combined:
State
sState
Chr
)SampleID
. Etcetera. When we apply all these, we have this final table.
FinalDF <- data.frame(SampleID = c(1,1,1,1,2),
Chr = c(1,1,1,2,1),
Start = c(1,201,401,500,1),
End = c(200,300,499,599,200),
State = c(3,2,3,3,2))
FinalDF
SampleID Chr Start End State
1 1 1 1 200 3
2 1 1 201 300 2
3 1 1 401 499 3
4 1 2 500 599 3
5 2 1 1 200 2
So, far, I've tried using the reduce function from the GenomicRanges package, but it doesn't work.
INCORRECT OUTPUT
reduce(DF2)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 1, 300] *
[2] 1 [401, 499] *
[3] 2 [500, 501] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
I was trying to do something with data.table, since my data.frames are 10 million rows long or more, but haven't been able to figure it out.
The following question is along the same lines (maybe a bit more complex), but has no solution. R- collapse rows based on contents of two columns
Upvotes: 1
Views: 2702
Reputation: 49448
library(data.table)
dt = as.data.table(DF) # or convert in place using setDT
dt[, .(Start = min(Start), End = max(End), State = State[1])
, by = .(SampleID, Chr, rleid(State),
cumsum(c(FALSE, head(End + 1, -1) < tail(Start, -1))))]
# SampleID Chr rleid cumsum Start End State
#1: 1 1 1 0 1 200 3
#2: 1 1 2 0 201 300 2
#3: 1 1 3 1 401 499 3
#4: 1 2 3 1 500 599 3
#5: 2 1 4 1 1 200 2
Upvotes: 4
Reputation: 21
# This code is kind of robust but it appears to get the job done
DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
Chr = c(1,1,1,1,2,1,1),
Start = c(1, 101, 201, 401, 500, 1, 101),
End = c(100, 200, 300, 499, 599, 100, 200),
State = c(3,3,2,3,3,2,2)
)
test_and_combine <- function(r1,r2) {
if (r1[,1] == r2[,1] & # check if "SampleID" column matches
r1[,2] == r2[,2] & # check if "Chr" column matches
(r1[,4] + 1) == r2[,3] & # test if Start and End are in sequence
r1[,5] == r2[,5]) # check if "State"column matches
{
# merge rows if true
DF_comb <- r1[,]
DF_comb[1,4] <- r2[,4]
}
else{
DF_comb <- NA
}
return(DF_comb)
}
# This section could rewritten to use Reduce()
DF_comb_final <- data.frame()
for(i in 1:(nrow(DF)-1)){ # loop through ever row of data.frame
DF_temp <- test_and_combine(DF[i,],DF[i+1,]) # send two rows to function
if(!any(is.na(DF_temp))){
DF_comb_final <- rbind(DF_comb_final,DF_temp)
}
}
Upvotes: 1
Reputation: 2375
If I interpret correctly what you want to do, I suggest the following: use dplyr
to group by the metadata you want to keep separate and then use GenomicRanges
to figure out the ranges within each group (if you run into performance issues you may want to steer clear of data.frame
required for GenomicRanges
and implement it by hand to take advantage of the performance of dyplr
with data.tables). Here's an example of how this would work (making use of the pipe %>%
to make it easier to see what's going on):
DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
Chr = c(1,1,1,1,2,1,1),
Start = c(1, 101, 201, 401, 500, 1, 101),
End = c(100, 200, 300, 499, 599, 100, 200),
State = c(3,3,2,3,3,2,2)
)
library(dplyr)
# take your data frame
DF %>%
# group it by the subsets
group_by(SampleID, Chr, State) %>%
# operate on each group
do(
# turn subset into a GRanges object
as(as.data.frame(.), "GRanges") %>%
# reducae ranges
GenomicRanges::reduce() %>%
# turn back into data frame for dplyr to stitch together
as.data.frame() %>%
# get the information you want
select(start, end, width)
) %>%
# ungroup for future operations
ungroup() %>%
# sort by what makes most sense for your set
arrange(SampleID, Chr, start)
Output:
Source: local data frame [5 x 6]
SampleID Chr State start end width
(dbl) (dbl) (dbl) (int) (int) (int)
1 1 3 1 200 200
1 1 2 201 300 100
1 1 3 401 499 99
1 2 3 500 599 100
2 1 2 1 200 200
Upvotes: 2