Reputation: 970
I am trying to implement a function to get values from one table based on another. The actual dataframes have > 50,000 observations, so implementing this nested for loop is not effective. I've been trying to look through SO the past few days to find something that works, but haven't been able to. My data is in no particular order (individuals, segments, etc), so it needs to be able to work even if things are out of order.
Here are toy examples of my data to work with:
region_map <- data.frame(Start = c(721290, 1688193), End= c(1688192, 2926555))
individual <- c("Ind1","Ind2","Ind3","Ind4")
segment <- data.frame(SampleID = c("Ind1","Ind1","Ind2","Ind2","Ind3","Ind3","Ind4","Ind4","Ind4"),
Start = c(721290, 1688194, 721290, 1688200, 721290, 2926600, 721290, 1688193, 690),
End = c(1688192, 2926555,1688190, 2900000, 2926555, 3000000, 1500000, 2005000, 500000),
State = c(1,2,2,5,4,2,2,6,5))
And here's a simplified example of what I'm trying to do:
Generate.FullSegmentList <- function(segments, individuals, regionmap){
FullSegments <- data.frame()
for(region in 1:nrow(regionmap)){
for(ind in individuals){
# If there is not a segment within that region for that individual
if(nrow(
segments[segments$start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind , ]
) == 0){
Temp <- data.frame(SampleID = ind,
Start = regionmap$Start[region],
End = regionmap$End[region],
State = 3
)
}
# If there is a segment within that region for that individual
if(nrow(
segments[segments$Start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind , ]
) == 1){
Temp <- data.frame(SampleID = segments$SampleID,
Start = regionmap$Start[region],
End = regionmap$End[region],
State = segments$State[segments$Start >= regionmap$Start[region] &
segments$SampleID == ind ]
)
}
FullSegments <- list(FullSegments, Temp)
}
}
FullSegments
}
In words, I need to look at each region (~53,000) and assign a value (State
, if none exists, give value of 3) to the region for each individual
, and then create a new data.frame with every region for every individual. To do this, I'm looping through the regions and then the individuals, finding a segment
(there are ~25,000 of these) that overlaps with the region and then appending it to the table.
Here is what the output from the above toy data would give:
SampleID Start End State
Ind1 721290 1688192 1
Ind1 1688193 2926555 2
Ind2 721290 1688192 2
Ind2 1688193 2926555 5
Ind3 721290 1688192 4
Ind3 1688193 2926555 4
Ind4 721290 1688192 2
Ind4 1688193 2926555 6
This function as-is works exactly how I need it to, except that it will take a VERY long time to run (using system.time, I got that it would take over 3 months to run). I know there must be a better way to do this. I've tried implementing apply functions, and I saw in some other questions to use lists instead of a data.frame. I also saw that there are data.table and plyr options to simplify this. I've tried these but haven't been successful at getting it to work with the nested loop with if statements.
I would appreciate an explanation of any answers given, as this is the first time I've written anything this complex.
Questions I think are relevant:
Many other questions on nested for loops involve doing calculations that work well for doing an apply function (e.g. apply(df, 1, function(x){ mean(x) }
), but I haven't been able to adopt that to mapping values from data.frame to data.frame.
Upvotes: 1
Views: 375
Reputation: 26258
I don't think you need anything 'this complex'. You can do everything you're after with a couple of joins. In this case I'll be using data.table
.
You've asked for an explanation of any answer, however, for this I can't do better than point you in the direction of the data.table homepage. It will be important to understand what the set*
and :=
commands do and how 'update by-reference' works.
Set your data to data.table
s.
library(data.table)
dt_individual <- data.table(SampleID = individual)
dt_region <- data.table(region_map)
dt_segment <- data.table(segment)
Just join it all together
## Change some column names of `dt_segment` so we can identify them after the joins
setnames(dt_segment, c("Start", "End"), c("seg_Start", "seg_End"))
## create a 'key_col' to join all the individuals to the regions
dt_join <- dt_individual[, key_col := 1][ dt_region[, key_col := 1], on="key_col", allow.cartesian=T][, key_col := NULL]
# SampleID Start End
# 1: Ind1 721290 1688192
# 2: Ind2 721290 1688192
# 3: Ind3 721290 1688192
# 4: Ind4 721290 1688192
# 5: Ind1 1688193 2926555
# 6: Ind2 1688193 2926555
# 7: Ind3 1688193 2926555
# 8: Ind4 1688193 2926555
Now use the foverlaps
function to find the overlapping regions
setkey(dt_join, SampleID, Start, End)
setkey(dt_segment, SampleID, seg_Start, seg_End)
foverlaps(dt_join,
dt_segment,
type="any")
# SampleID seg_Start seg_End State Start End
# 1: Ind1 721290 1688192 1 721290 1688192
# 2: Ind1 1688194 2926555 2 1688193 2926555
# 3: Ind2 721290 1688190 2 721290 1688192
# 4: Ind2 1688200 2900000 5 1688193 2926555
# 5: Ind3 721290 2926555 4 721290 1688192
# 6: Ind3 721290 2926555 4 1688193 2926555
# 7: Ind4 721290 1500000 2 721290 1688192
# 8: Ind4 1688193 2005000 6 1688193 2926555
To see all the data (i.e. both those that fall within the regions and those that don't), you can do a cartesian
join, and then assign values to those within the region and those outside it as you wish
dt_join[dt_segment, on="SampleID", nomatch=0, allow.cartesian=T]
Upvotes: 0
Reputation: 46886
The Bioconductor package IRanges works on 'integer ranges' like the region and segment start and end coordinates. Install the package with
source("https://bioconductor.org/biocLite.R")
biocLite("IRanges")
Load it and create a representation of the ranges of interest
library(IRanges)
r <- with(region_map, IRanges(Start, End))
s <- with(segments, IRanges(Start, End))
The result so far is
> r
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 721290 1688192 966903
[2] 1688193 2926555 1238363
> s
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 721290 1688193 966904
[2] 1688194 2926555 1238362
[3] 721290 1688190 966901
[4] 1688200 2900000 1211801
[5] 721290 2926555 2205266
[6] 2926600 3000000 73401
[7] 721290 1500000 778711
[8] 1688193 2005000 316808
[9] 690 500000 499311
You're interested in finding the overlaps between the 'query' segments and the 'subject' region_map
olaps <- findOverlaps(s, r)
giving
> olaps
Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 2 2
[4] 3 1
[5] 4 2
[6] 5 1
[7] 5 2
[8] 7 1
[9] 8 2
-------
queryLength: 9 / subjectLength: 2
This will scale well to millions of overlaps.
You said you're interested in the state of all individuals in all regions, and from your code it looks like an individual not in a region has state 3. I created a matrix with all state 3
state <- matrix(3, nrow(region_map), length(individual),
dimnames=list(NULL, individual))
then created a two-column index into the matrix based on the overlaps we found
idx <- matrix(c(subjectHits(olaps),
match(segments$SampleID[queryHits(olaps)], individual)),
ncol=2)
and used the index matrix to update the state
state[idx] <- segments$State[queryHits(olaps)]
This actually summarizes your desired result -- the state in each region x individual combination. One possible issue is when two segments of the same individual overlap a single region, and the segments have different state; only one state will be assigned.
> state
Ind1 Ind2 Ind3 Ind4
[1,] 1 2 4 2
[2,] 2 5 4 6
Cast it as a data.frame with, e.g.,
data.frame(SampleID=colnames(state)[col(state)],
Start=region_map[row(state), "Start"],
End=region_map[row(state), "End"],
State=as.vector(state))
Upvotes: 2
Reputation: 2962
You have a lot of lines in your code that read nrow(some-subset-of-your-data)
. You would see a quick performance increase if you switched these to sum(the-conditions)
. For example:
Turn:
nrow(segments[segments$start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind , ]) == 0
Into
sum(segments$start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind) == 0
This way, R doesn't store your subsetted data frame in memory every time.
In addition, store this operation as a boolean so you only need to call it once in each loop.
isEmpty <- sum(segments$start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind) == 0
if(isEmpty){
### do something
} else if(!isEmpty) {
### do something else
}
Upvotes: 1