Reputation: 10543
Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame
in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
data.frames
data.frame
can have 0 or more matching entries in the other.data.frame
sFor a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql
INNER JOIN
of the two data.frames, for entries where isMatch
is TRUE
.
I've tried to construct the two data.frames
so that there can be 0 or more matches in the other data.frame
.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
apply
, gave me a list
where each element was a matrix,
with no way to rbind
them.joined
first, because I don't
know how many rows I will need in the end.I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?
Upvotes: 6
Views: 3340
Reputation: 157
After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....
awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt
which produce the kind of same results:
b 1 100 200 1 1 105
a 2 100 200 3 2 96
a 2 100 200 4 2 206
b 1 100 200 5 1 150
e 3 321 567 6 3 400
b 1 100 200 9 1 120
Upvotes: 0
Reputation: 13363
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
Upvotes: 4
Reputation: 10543
Another answer I've come up with using the sqldf
package.
sqldf("SELECT gene, genes.chromosome, start, end, marker, position
FROM genes JOIN markers ON genes.chromosome = markers.chromosome
WHERE position >= (start - 10) AND position <= (end + 10)")
Using microbenchmark
it performs comparably to @alexwhan's merge
and [
method.
> microbenchmark(alexwhan, sql)
Unit: nanoseconds
expr min lq median uq max neval
alexwhan 435 462.5 468.0 485 2398 100
sql 422 456.5 466.5 498 1262 100
I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes
, 2,000,000 rows for markers
, with the joined
output coming to 480,000 rows).
Unfortunately merge
seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers)
with an error (which i don't get if reduce the number of rows):
Error in merge.data.frame(genes, markers) :
negative length vectors are not allowed
While the sqldf
method runs successfully in 29 minutes.
Upvotes: 2
Reputation: 16026
I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:
joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
# chromosome gene start end marker position
# 1 1 b 100 200 1 105
# 2 1 b 100 200 5 150
# 4 1 b 100 200 9 120
# 10 2 a 100 200 4 206
# 11 2 a 100 200 3 96
# 16 3 e 321 567 6 400
Upvotes: 4