Gaius Augustus
Gaius Augustus

Reputation: 970

R - Nested for loops and slow performance

I am trying to implement a function to get values from one table based on another. The actual dataframes have > 50,000 observations, so implementing this nested for loop is not effective. I've been trying to look through SO the past few days to find something that works, but haven't been able to. My data is in no particular order (individuals, segments, etc), so it needs to be able to work even if things are out of order.

Here are toy examples of my data to work with:

region_map <- data.frame(Start = c(721290, 1688193), End= c(1688192, 2926555))
individual <- c("Ind1","Ind2","Ind3","Ind4")
segment <- data.frame(SampleID = c("Ind1","Ind1","Ind2","Ind2","Ind3","Ind3","Ind4","Ind4","Ind4"),
                      Start = c(721290, 1688194, 721290, 1688200, 721290, 2926600, 721290, 1688193, 690),
                      End = c(1688192, 2926555,1688190, 2900000, 2926555, 3000000, 1500000, 2005000, 500000),
                      State = c(1,2,2,5,4,2,2,6,5))

And here's a simplified example of what I'm trying to do:

Generate.FullSegmentList <- function(segments, individuals, regionmap){
     FullSegments <- data.frame()
     for(region in 1:nrow(regionmap)){

          for(ind in individuals){
               # If there is not a segment within that region for that individual
               if(nrow(
                    segments[segments$start >= regionmap$Start[region] & 
                                  segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]
               ) == 0){
                    Temp <- data.frame(SampleID = ind, 
                                       Start = regionmap$Start[region],
                                       End = regionmap$End[region],
                                       State = 3
                    )
               }
               # If there is a segment within that region for that individual
               if(nrow(
                    segments[segments$Start >= regionmap$Start[region] & 
                                  segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]
               ) == 1){
                    Temp <- data.frame(SampleID = segments$SampleID, 
                                       Start = regionmap$Start[region],
                                       End = regionmap$End[region],
                                       State = segments$State[segments$Start >= regionmap$Start[region] & 
                                                                  segments$SampleID == ind ]
                    )
               }
               FullSegments <- list(FullSegments, Temp)              
          }
     }
     FullSegments
}

In words, I need to look at each region (~53,000) and assign a value (State, if none exists, give value of 3) to the region for each individual, and then create a new data.frame with every region for every individual. To do this, I'm looping through the regions and then the individuals, finding a segment (there are ~25,000 of these) that overlaps with the region and then appending it to the table.

Here is what the output from the above toy data would give:

SampleID       Start       End        State
Ind1          721290      1688192      1
Ind1          1688193     2926555      2
Ind2          721290      1688192      2
Ind2          1688193     2926555      5
Ind3          721290      1688192      4
Ind3          1688193     2926555      4
Ind4          721290      1688192      2
Ind4          1688193     2926555      6

This function as-is works exactly how I need it to, except that it will take a VERY long time to run (using system.time, I got that it would take over 3 months to run). I know there must be a better way to do this. I've tried implementing apply functions, and I saw in some other questions to use lists instead of a data.frame. I also saw that there are data.table and plyr options to simplify this. I've tried these but haven't been successful at getting it to work with the nested loop with if statements.

I would appreciate an explanation of any answers given, as this is the first time I've written anything this complex.

Questions I think are relevant:

Many other questions on nested for loops involve doing calculations that work well for doing an apply function (e.g. apply(df, 1, function(x){ mean(x) } ), but I haven't been able to adopt that to mapping values from data.frame to data.frame.

Upvotes: 1

Views: 375

Answers (3)

SymbolixAU
SymbolixAU

Reputation: 26258

I don't think you need anything 'this complex'. You can do everything you're after with a couple of joins. In this case I'll be using data.table.

You've asked for an explanation of any answer, however, for this I can't do better than point you in the direction of the data.table homepage. It will be important to understand what the set* and := commands do and how 'update by-reference' works.

Set your data to data.tables.

library(data.table)

dt_individual <- data.table(SampleID = individual)
dt_region <- data.table(region_map)
dt_segment <- data.table(segment)

Just join it all together

## Change some column names of `dt_segment` so we can identify them after the joins
setnames(dt_segment, c("Start", "End"), c("seg_Start", "seg_End"))

## create a 'key_col' to join all the individuals to the regions
dt_join <- dt_individual[, key_col := 1][ dt_region[, key_col := 1], on="key_col", allow.cartesian=T][, key_col := NULL]
#    SampleID   Start     End
# 1:     Ind1  721290 1688192
# 2:     Ind2  721290 1688192
# 3:     Ind3  721290 1688192
# 4:     Ind4  721290 1688192
# 5:     Ind1 1688193 2926555
# 6:     Ind2 1688193 2926555
# 7:     Ind3 1688193 2926555
# 8:     Ind4 1688193 2926555

Now use the foverlaps function to find the overlapping regions

setkey(dt_join, SampleID, Start, End)
setkey(dt_segment, SampleID, seg_Start, seg_End)

foverlaps(dt_join,
          dt_segment,
          type="any")

#    SampleID seg_Start seg_End State   Start     End
# 1:     Ind1    721290 1688192     1  721290 1688192
# 2:     Ind1   1688194 2926555     2 1688193 2926555
# 3:     Ind2    721290 1688190     2  721290 1688192
# 4:     Ind2   1688200 2900000     5 1688193 2926555
# 5:     Ind3    721290 2926555     4  721290 1688192
# 6:     Ind3    721290 2926555     4 1688193 2926555
# 7:     Ind4    721290 1500000     2  721290 1688192
# 8:     Ind4   1688193 2005000     6 1688193 2926555

To see all the data (i.e. both those that fall within the regions and those that don't), you can do a cartesian join, and then assign values to those within the region and those outside it as you wish

dt_join[dt_segment, on="SampleID", nomatch=0, allow.cartesian=T]

Upvotes: 0

Martin Morgan
Martin Morgan

Reputation: 46886

The Bioconductor package IRanges works on 'integer ranges' like the region and segment start and end coordinates. Install the package with

source("https://bioconductor.org/biocLite.R")
biocLite("IRanges")

Load it and create a representation of the ranges of interest

library(IRanges)
r <- with(region_map, IRanges(Start, End))
s <- with(segments, IRanges(Start, End))

The result so far is

> r
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    721290   1688192    966903
  [2]   1688193   2926555   1238363
> s
IRanges object with 9 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    721290   1688193    966904
  [2]   1688194   2926555   1238362
  [3]    721290   1688190    966901
  [4]   1688200   2900000   1211801
  [5]    721290   2926555   2205266
  [6]   2926600   3000000     73401
  [7]    721290   1500000    778711
  [8]   1688193   2005000    316808
  [9]       690    500000    499311

You're interested in finding the overlaps between the 'query' segments and the 'subject' region_map

olaps <- findOverlaps(s, r)

giving

> olaps
Hits object with 9 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         2           2
  [4]         3           1
  [5]         4           2
  [6]         5           1
  [7]         5           2
  [8]         7           1
  [9]         8           2
  -------
  queryLength: 9 / subjectLength: 2

This will scale well to millions of overlaps.

You said you're interested in the state of all individuals in all regions, and from your code it looks like an individual not in a region has state 3. I created a matrix with all state 3

state <- matrix(3, nrow(region_map), length(individual),
                dimnames=list(NULL, individual))

then created a two-column index into the matrix based on the overlaps we found

idx <- matrix(c(subjectHits(olaps),
                match(segments$SampleID[queryHits(olaps)], individual)),
              ncol=2)

and used the index matrix to update the state

state[idx] <- segments$State[queryHits(olaps)]

This actually summarizes your desired result -- the state in each region x individual combination. One possible issue is when two segments of the same individual overlap a single region, and the segments have different state; only one state will be assigned.

> state
     Ind1 Ind2 Ind3 Ind4
[1,]    1    2    4    2
[2,]    2    5    4    6

Cast it as a data.frame with, e.g.,

data.frame(SampleID=colnames(state)[col(state)],
           Start=region_map[row(state), "Start"],
           End=region_map[row(state), "End"],
           State=as.vector(state))

Upvotes: 2

Chris Conlan
Chris Conlan

Reputation: 2962

You have a lot of lines in your code that read nrow(some-subset-of-your-data). You would see a quick performance increase if you switched these to sum(the-conditions). For example:

Turn:

nrow(segments[segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]) == 0

Into

sum(segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind) == 0

This way, R doesn't store your subsetted data frame in memory every time.

In addition, store this operation as a boolean so you only need to call it once in each loop.

isEmpty <- sum(segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind) == 0

if(isEmpty){
### do something
} else if(!isEmpty) {
### do something else
}

Upvotes: 1

Related Questions