Reputation: 1159
Note: I am using an example dataset from a previous post for convenience purposes.
Suppose that there are two data sets, ref
and map
. They are:
ref <- data.table(space=rep('nI',3),t1=c(100,300,500),t2=c(150,400,600),id=letters[1:3])
map <- data.table(space=rep('nI',241),t1=seq(0,1200,by=5),t2=seq(5,1205,by=5),res=rnorm(241))
they look like:
> ref
space t1 t2 id
1: nI 100 150 a
2: nI 300 400 b
3: nI 500 600 c
> map
space t1 t2 res
1: nI 0 5 -0.7082922
2: nI 5 10 1.8251041
3: nI 10 15 0.2076552
4: nI 15 20 0.8047347
5: nI 20 25 2.3388920
---
237: nI 1180 1185 1.0229284
238: nI 1185 1190 -0.3657815
239: nI 1190 1195 0.3013489
240: nI 1195 1200 1.2947271
241: nI 1200 1205 -1.5050221
Now, it has come to my attention that in the data.table package that is still under development, the function foverlaps
will fill in intervals in ref
with the corresponding rows in map
.
setkey(ref,space,t1,t2)
foverlaps(map,ref,type="within",nomatch=0L)
which gives:
space t1 t2 id i.t1 i.t2 res
1: nI 100 150 a 100 105 -0.85202726
2: nI 100 150 a 105 110 0.79748876
3: nI 100 150 a 110 115 1.49894097
4: nI 100 150 a 115 120 0.47719957
5: nI 100 150 a 120 125 -0.95767896
6: nI 100 150 a 125 130 -0.51054673
7: nI 100 150 a 130 135 -0.08478700
8: nI 100 150 a 135 140 -0.69526566
9: nI 100 150 a 140 145 2.14917623
10: nI 100 150 a 145 150 -0.05348163
11: nI 300 400 b 300 305 0.28834548
12: nI 300 400 b 305 310 0.32449616
13: nI 300 400 b 310 315 1.16107248
14: nI 300 400 b 315 320 1.08550676
15: nI 300 400 b 320 325 0.84640788
16: nI 300 400 b 325 330 -2.15485447
17: nI 300 400 b 330 335 1.59115714
18: nI 300 400 b 335 340 -0.57588128
19: nI 300 400 b 340 345 0.23957563
20: nI 300 400 b 345 350 -0.60824259
21: nI 300 400 b 350 355 -0.84828189
22: nI 300 400 b 355 360 -0.43528701
23: nI 300 400 b 360 365 -0.80026281
24: nI 300 400 b 365 370 -0.62914234
25: nI 300 400 b 370 375 -0.83485164
26: nI 300 400 b 375 380 1.46922713
27: nI 300 400 b 380 385 -0.53965310
28: nI 300 400 b 385 390 0.98728765
29: nI 300 400 b 390 395 -0.66328893
30: nI 300 400 b 395 400 -0.08182384
31: nI 500 600 c 500 505 0.72566100
32: nI 500 600 c 505 510 2.27878366
33: nI 500 600 c 510 515 0.72974139
34: nI 500 600 c 515 520 -0.35358019
35: nI 500 600 c 520 525 -1.20697646
36: nI 500 600 c 525 530 -0.01719057
37: nI 500 600 c 530 535 0.06686472
38: nI 500 600 c 535 540 -0.40866088
39: nI 500 600 c 540 545 -1.02697573
40: nI 500 600 c 545 550 2.19822065
41: nI 500 600 c 550 555 0.57075648
42: nI 500 600 c 555 560 -0.52009726
43: nI 500 600 c 560 565 -1.82999177
44: nI 500 600 c 565 570 2.53776578
45: nI 500 600 c 570 575 0.85626293
46: nI 500 600 c 575 580 -0.34245708
47: nI 500 600 c 580 585 1.21679869
48: nI 500 600 c 585 590 1.87587020
49: nI 500 600 c 590 595 -0.23325264
50: nI 500 600 c 595 600 0.18845022
space t1 t2 id i.t1 i.t2 res
to run the developement version of data.table 1.9.3, the following code will help you run it:
install.packages("devtools")
library(devtools)
dev_mode(on=T)
install_github("Rdatatable/data.table", build_vignettes=FALSE)
dev_mode(on=F)
What I am trying to do instead:
The above basically lists out all the intervals contained within the time intervals. However, I am trying to just create a new column by ref
that counts the number of rows in map
that are within the time intervals of ref
. Hence, the table I would like is:
> ref
space t1 t2 id count
1: nI 100 150 a 10
2: nI 300 400 b 20
3: nI 500 600 c 20
The count for each indicates how many rows of map
fell in between each time interval of ref
. While I understand a very basic solution is to just use a sum or count function to count, is there a solution that would create the counts without having to first create the larger filled in dataset? I say this because my real data contains over 300 million observations. Any suggestions would be greatly helpful! Thank you!
Upvotes: 2
Views: 2191
Reputation: 118779
You can use which=TRUE
argument to get the position of overlaps, and then use to get the counts by doing a simple aggregation:
ans = foverlaps(map, ref, type="within", nomatch=0L, which=TRUE)[, .N, by=yid]
# yid N
# 1: 1 10
# 2: 2 20
# 3: 3 20
And then get this back in ref
. But we should provide a more direct way to accomplish this.
Upvotes: 6
Reputation: 46856
The Biocondcutor GenomicRanges package, developed for genetic data, has the notion of a 'space' (sequence name), coordinates (integer ranges defined as closed intervals of start and end coordinates) within the space, and 'metadata' columns that are associate with each range. So
library(GenomicRanges)
ref <- with(ref, GRanges(space, IRanges(t1, t2), id=id))
map <- with(map, GRanges(space, IRanges(t1, t2), res=rnorm(241)))
GenomicRanges support many range-based operations, including
ref$count <- countOverlaps(ref, map)
to accomplish what you are interested in (countOverlaps supports different notions of overlap, so perhaps the default (any overlap between ranges) is not what you're interested in; a 'range' is inclusive of its start and end, but range can easily be shifted or narrowed; both of these might mean that the direct application of countOverlaps()
is different from what you were expecting.
To process large amounts of data, it's natural to iterate through in chunks (e.g., of 10M rows of the larger data), either by opening a connection to the data source and reading in successive rows, or by making range-based queries of a data base or whatever file format you have.
con <- open("my.csv")
while (nrow(map0 <- read.csv(con, nrows=10000000))) {
map <- with(map0, GRanges(space, IRanges(t1, t2)))
ref$count <- ref$count + countOverlaps(ref, map)
}
close(con)
Upvotes: 2
Reputation: 11597
You could do it like this:
count <- function(x, y) map[,sum(t1>=x & t2<=y)]
ref[, count:=mapply(x=t1, y=t2, count)]
ref
space t1 t2 id count
1: nI 100 150 a 10
2: nI 300 400 b 20
3: nI 500 600 c 20
Upvotes: 3