Gaius Augustus
Gaius Augustus

Reputation: 970

R - Compare rows of data.frame (to combine based on conditions)

I'm starting with a data.frame of genome ranges (a chromosome, and a start - end position). I'm trying to combine rows that are 1) adjacently positioned and 2) that share a value in two other columns. Note: I'd like an efficient method since my real data is > 10 million rows. (data.table if possible)

Toy data:

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
                 )
DF
   SampleID Chr Start End State
1:        1   1     1 100     3
2:        1   1   101 200     3
3:        1   1   201 300     2
4:        1   1   401 499     3
5:        1   2   500 501     3
6:        2   1     1 100     2
7:        2   1   101 200     2

Lines 1 & 2 can be combined because they are adjacent (1-100 & 101-200) and share a SampleID (1) and State (3).

The following cannot be combined:

Etcetera. When we apply all these, we have this final table.

FinalDF <- data.frame(SampleID = c(1,1,1,1,2),
                      Chr = c(1,1,1,2,1),
                      Start = c(1,201,401,500,1),
                      End = c(200,300,499,599,200),
                      State = c(3,2,3,3,2))
FinalDF
  SampleID Chr Start End State
1        1   1     1 200     3
2        1   1   201 300     2
3        1   1   401 499     3
4        1   2   500 599     3
5        2   1     1 200     2

So, far, I've tried using the reduce function from the GenomicRanges package, but it doesn't work.

INCORRECT OUTPUT

reduce(DF2)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [  1, 300]      *
  [2]        1 [401, 499]      *
  [3]        2 [500, 501]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

I was trying to do something with data.table, since my data.frames are 10 million rows long or more, but haven't been able to figure it out.

The following question is along the same lines (maybe a bit more complex), but has no solution. R- collapse rows based on contents of two columns

Upvotes: 1

Views: 2702

Answers (3)

eddi
eddi

Reputation: 49448

library(data.table)

dt = as.data.table(DF) # or convert in place using setDT

dt[, .(Start = min(Start), End = max(End), State = State[1])
   , by = .(SampleID, Chr, rleid(State),
            cumsum(c(FALSE, head(End + 1, -1) < tail(Start, -1))))]
#   SampleID Chr rleid cumsum Start End State
#1:        1   1     1      0     1 200     3
#2:        1   1     2      0   201 300     2
#3:        1   1     3      1   401 499     3
#4:        1   2     3      1   500 599     3
#5:        2   1     4      1     1 200     2

Upvotes: 4

Edward.LennonJr
Edward.LennonJr

Reputation: 21

# This code is kind of robust but it appears to get the job done

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
)

test_and_combine <- function(r1,r2) {
  if (r1[,1] == r2[,1] & # check if "SampleID" column matches
      r1[,2] == r2[,2] & # check if  "Chr" column matches
      (r1[,4] + 1) == r2[,3] & # test if Start and End are in sequence
      r1[,5] == r2[,5]) # check if "State"column matches
    {
    # merge rows if true
    DF_comb <- r1[,]
    DF_comb[1,4] <- r2[,4]

  }
  else{
    DF_comb <- NA 
  }
  return(DF_comb)
}

# This section could rewritten to use Reduce()
DF_comb_final <- data.frame()
for(i in 1:(nrow(DF)-1)){ # loop through ever row of data.frame
  DF_temp <- test_and_combine(DF[i,],DF[i+1,]) # send two rows to function
  if(!any(is.na(DF_temp))){
    DF_comb_final <- rbind(DF_comb_final,DF_temp)    
  }
}

Upvotes: 1

sebkopf
sebkopf

Reputation: 2375

If I interpret correctly what you want to do, I suggest the following: use dplyr to group by the metadata you want to keep separate and then use GenomicRanges to figure out the ranges within each group (if you run into performance issues you may want to steer clear of data.frame required for GenomicRanges and implement it by hand to take advantage of the performance of dyplr with data.tables). Here's an example of how this would work (making use of the pipe %>% to make it easier to see what's going on):

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
)

library(dplyr)
# take your data frame
DF %>% 
  # group it by the subsets
  group_by(SampleID, Chr, State) %>% 
  # operate on each group
  do(
    # turn subset into a GRanges object
    as(as.data.frame(.), "GRanges") %>%
      # reducae ranges
      GenomicRanges::reduce() %>% 
      # turn back into data frame for dplyr to stitch together
      as.data.frame() %>% 
      # get the information you want
      select(start, end, width)
  ) %>% 
  # ungroup for future operations
  ungroup() %>% 
  # sort by what makes most sense for your set
  arrange(SampleID, Chr, start)

Output:

Source: local data frame [5 x 6]

SampleID Chr State start end width
(dbl) (dbl) (dbl) (int) (int) (int)
 1     1     3     1   200   200
 1     1     2   201   300   100
 1     1     3   401   499    99
 1     2     3   500   599   100
 2     1     2     1   200   200

Upvotes: 2

Related Questions