Reputation: 941
I have data.frame that consist of two different index vector, query
named as que
, target
named as subj
, is the result of searching overlapping of interval data of three individual data.frame simultaneously as an input (consider aligning three interval set by parallel). However, inputDF
is the result of searching overlapping interval. I want to construct this data.frame with position index in a special way, such as reduce the dimension of inputDF
, regroup the index and rebuild new data.frame which geometrically show pairs of the overlapping index. Is there any way to manipulate inputDF
and reconstruct my desired data.frame? Can anyone point me how to make this happen easily? Is there any efficient way to work with inputDF
and build desired data.frame? Any idea?
Here is the visualization of interval aligning:
Here is the resulted example data.frame:
inputDF <- data.frame(
que=c(5 , 7 , 8 , 9 ,14 ,16, 17 ,20 ,21, 22 , 8 , 9 ,16 ,22 , 2 ,12 ,15 ,18,
21 , 4 , 3 , 7 ,15 ,21 ,13 ,19 , 4 , 5 , 6, 13, 14, 19 ,20, 2 , 3 ,12,
18 , 6 , 5 ,11, 14, 20 ,8 ,16 ,22 , 9 ,17 , 1, 10 , 1 , 2 , 3, 11,12,
18 , 1 ,10),
subj=c( 5 , 7 , 8, 17 , 5 ,8 ,17 , 5 ,7 ,8, 22 ,22, 22, 22 , 2 ,2 ,15, 2,
15 ,4 ,3 ,21 ,21 ,21 ,13 ,13 ,20 ,20 ,20 ,19 ,20 ,19 ,20 ,12 ,12 ,12,
12 ,6 ,14 ,11 ,14 ,14 ,16 ,16 ,16 ,9 ,9 ,1 ,1 ,18 ,18 ,18 ,18 ,18, 18 ,10 ,10)
)
In order to build desired data.frame, I used NA
to replace non-overlapped interval in subj_2
;
This is my desired data.frame
:
desiredDF <- data.frame(
que=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22),
self.subj=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22),
subj_1=c(10,12,12,20,14,20,21,16,17,1,18,12,19,5,21,8,9,12,13,5,7,8),
subj_2=c(18,18,18,NA,20,NA,NA,22,22,NA,NA,18,NA,20,NA,22,NA,18,NA,14,15,16)
)
Edit :
for example, these are interval data and how my desired data.frame constructed :
intDF <- list(
bar=data.frame(start=c(8,18,33,53,69,81,105,115,135),
stop=c(14,21,39,61,73,87,111,120,153)),
cat=data.frame(start=c(6,15,20,44,71,99,113,141),
stop=c(10,17,34,51,78,103,124,147)),
foo=data.frame(start=c(11,43,57,101,117),
stop=c(36,49,92,109,139))
)
intDF <- bind_rows(intDF) # now it is easier to understand position index, such as `10`,`11` refers to 10th, 11th row in `intDF` and so on.
que self.sub subj1 subj2
1 1 10 18
2 2 12 18
3 3 12 18
4 4 20
5 5 14 20
6 6 20
7 7 21
8 8 16 22
How can I achieve my desired data.frame? Are there any efficient way to manipulate inputDF
for building desired data.frame?
Upvotes: 1
Views: 74
Reputation: 24945
We can do this using dplyr.
First we groupby your 'que', sort by 'subj', then set the columns to be the first and second subj which is not equal to the 'que':
library(dplyr)
inputDF %>%
group_by(que) %>%
arrange(subj) %>%
summarise(self.sub = que[1], subj1 = subj[subj!=que][1], subj2 = subj[subj!=que][2])
Source: local data frame [22 x 4]
que self.sub subj1 subj2
(dbl) (dbl) (dbl) (dbl)
1 1 1 10 18
2 2 2 12 18
3 3 3 12 18
4 4 4 20 NA
5 5 5 14 20
6 6 6 20 NA
7 7 7 21 NA
8 8 8 16 22
9 9 9 17 22
10 10 10 1 NA
.. ... ... ... ...
In response to your edit, we can use the IRanges
package:
library(IRanges)
myranges = IRanges(start = intDF$start, end = intDF$stop)
data = as.data.frame(findOverlaps(myranges))
data
queryHits subjectHits
1 1 10
2 1 1
3 1 18
4 2 18
5 2 2
6 2 12
7 3 18
8 3 12
9 3 3
10 4 4
... ... ...
Upvotes: 2