Reputation: 3108
I have two large datasets that want to match with each other
library(tidyverse)
df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#> position
#> <dbl>
#> 1 10
#> 2 11
#> 3 200
#> 4 250
#> 5 300
df2 <- tibble(start=c(1,10,200,251),
end=c(20,100,250,350),
name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#> start end name
#> <dbl> <dbl> <chr>
#> 1 1 20 geneA
#> 2 10 100 geneB
#> 3 200 250 geneC
#> 4 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.1)
I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.
I want my data to look like this
position start end name
<dbl> <dbl> <dbl> <chr>
1 10 1 20 geneA
2 10 10 100 geneB
3 11 1 20 geneA
4 11 10 100 geneB
5 200 200 250 geneC
6 250 200 250 geneC
7 300 251 350 geneD
One way to solve this could be through crossing and filtering
df1 %>%
crossing(df2) %>%
filter(position >= start & position <= end)
However my dataset is way too large and can not afford crossing or expanding. Any other idea?
Upvotes: 2
Views: 322
Reputation: 269854
1) SQL engines can perform such operations without crossing. (It may be possible to speed it up even more if you add indexes.)
library(sqldf)
sqldf("select *
from df1 a
join df2 b on a.position between b.start and b.end")
2) data.table can also do some sql-like operations. (Be careful because the first variable in each comparison must be from the first data table and the second from the second. They can't be reordered so, for example, the first comparison could not be written as position <= start even though it is mathematically the same.) Again, adding indexes may improve the speed.
library(data.table)
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)[, c("start2", "end2") := .(start, end)]
dt2[dt1, on = .(start <= position, end >= position)]
Upvotes: 2
Reputation: 76545
Here is a dplyr
way (sort of).
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 <- tibble(position = c(10, 11, 200, 250, 300))
df2 <- tibble(
start = c(1, 10, 200, 251),
end = c(20, 100, 250, 350),
name = c("geneA", "geneB", "geneC", "geneD")
)
vbetween <- function(data, col, data2, start, end){
f <- function(x, l, r) l <= x & x <= r
col <- enquo(col)
start <- enquo(start)
end <- enquo(end)
x <- data %>% pull(!!col)
l <- data2 %>% pull(!!start)
r <- data2 %>% pull(!!end)
yes <- lapply(x, f, l = l, r = r)
lapply(yes, \(i) data2[i, ])
}
df1 %>% vbetween(position, df2, start, end) %>% bind_rows()
#> # A tibble: 7 x 3
#> start end name
#> <dbl> <dbl> <chr>
#> 1 1 20 geneA
#> 2 10 100 geneB
#> 3 1 20 geneA
#> 4 10 100 geneB
#> 5 200 250 geneC
#> 6 200 250 geneC
#> 7 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.1)
Upvotes: 1
Reputation: 10637
crossing
is a wrapper around expand_grid
and does additional stuff e.g. filtering. You can use it directly:
library(tidyverse)
df1 <- tibble(position = c(10, 11, 200, 250, 300))
df1
#> # A tibble: 5 × 1
#> position
#> <dbl>
#> 1 10
#> 2 11
#> 3 200
#> 4 250
#> 5 300
df2 <- tibble(
start = c(1, 10, 200, 251),
end = c(20, 100, 250, 350),
name = c("geneA", "geneB", "geneC", "geneD")
)
expand_grid(df1, df2) %>%
filter(position >= start & position <= end)
#> # A tibble: 7 × 4
#> position start end name
#> <dbl> <dbl> <dbl> <chr>
#> 1 10 1 20 geneA
#> 2 10 10 100 geneB
#> 3 11 1 20 geneA
#> 4 11 10 100 geneB
#> 5 200 200 250 geneC
#> 6 250 200 250 geneC
#> 7 300 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.0)
Upvotes: 1