Scott Ritchie
Scott Ritchie

Reputation: 10543

Efficiently merging two data frames on a non-trivial criteria

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.

The general case of the problem boils down to this:

For a concrete example I will use similar data to the linked question:

genes <- data.frame(gene       = letters[1:5], 
                    chromosome = c(2,1,2,1,3),
                    start      = c(100, 100, 500, 350, 321),
                    end        = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
                   chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
                   position   = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))

And our complex matching function:

# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
  return(
    marker$chromosome == gene$chromosome & 
    marker$postion >= (gene$start - 10) &
    marker$postion <= (gene$end + 10)
  )
}

The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE. I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.

The solution I came up with is as follows:

joined <- data.frame()
for (i in 1:nrow(genes)) {
   # This repeated subsetting returns the same results as `isMatch` applied across
   # the `markers` data.frame for each entry in `genes`.
   matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
   matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
   matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
   # matches may now be 0 or more rows, which we want to repeat the gene for:
   if(nrow(matches) != 0) {
     joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
   }
}

Giving the results:

   gene chromosome start end marker position
1     a          2   100 200      3       96
2     a          2   100 200      4      206
3     b          1   100 200      1      105
4     b          1   100 200      5      150
5     b          1   100 200      9      120
51    e          3   321 567      6      400

This is quite an ugly and clungy solution, but anything else I tried was met with failure:

I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?

Upvotes: 6

Views: 3340

Answers (4)

user2669497
user2669497

Reputation: 157

After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....

awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt

which produce the kind of same results:

b   1   100 200 1   1   105
a   2   100 200 3   2   96
a   2   100 200 4   2   206
b   1   100 200 5   1   150
e   3   321 567 6   3   400
b   1   100 200 9   1   120

Upvotes: 0

Blue Magister
Blue Magister

Reputation: 13363

A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

Upvotes: 4

Scott Ritchie
Scott Ritchie

Reputation: 10543

Another answer I've come up with using the sqldf package.

sqldf("SELECT gene, genes.chromosome, start, end, marker, position 
       FROM genes JOIN markers ON genes.chromosome = markers.chromosome 
       WHERE position >= (start - 10) AND position <= (end + 10)")

Using microbenchmark it performs comparably to @alexwhan's merge and [ method.

> microbenchmark(alexwhan, sql)
Unit: nanoseconds
     expr min    lq median  uq  max neval
 alexwhan 435 462.5  468.0 485 2398   100
      sql 422 456.5  466.5 498 1262   100

I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes, 2,000,000 rows for markers, with the joined output coming to 480,000 rows).

Unfortunately merge seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers) with an error (which i don't get if reduce the number of rows):

Error in merge.data.frame(genes, markers) : 
  negative length vectors are not allowed

While the sqldf method runs successfully in 29 minutes.

Upvotes: 2

alexwhan
alexwhan

Reputation: 16026

I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:

joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
#    chromosome gene start end marker position
# 1           1    b   100 200      1      105
# 2           1    b   100 200      5      150
# 4           1    b   100 200      9      120
# 10          2    a   100 200      4      206
# 11          2    a   100 200      3       96
# 16          3    e   321 567      6      400

Upvotes: 4

Related Questions