Reputation: 15708
I want to combine the functionality of .SD
with by =
in a non-equi join:
data.table - select first n rows within group
Example data:
tmp_dt1<- data.table(grp = c(1,2), time = c(0.2, 0.6, 0.4, 0.8, 0.25, 0.65))
tmp_dt2 <- data.table(grp = c(1,2), time_from = c(0.1, 0.5))
tmp_dt2 <- tmp_dt2[, time_to := time_from + 0.2]
> tmp_dt1
grp time
1: 1 0.20
2: 2 0.60
3: 1 0.40
4: 2 0.80
5: 1 0.25
6: 2 0.65
> tmp_dt2
grp time_from time_to
1: 1 0.1 0.3
2: 2 0.5 0.7
Now, my desired output is the first time in each group that lies between the ranges defined in tmp_dt2
. I can get all such times by:
> tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]
grp time time_from time_to
1: 1 0.20 0.1 0.3
2: 1 0.25 0.1 0.3
3: 2 0.60 0.5 0.7
4: 2 0.65 0.5 0.7
However, I am having some trouble extracting the first n rows from each grp
using by
, without chaining. As an example, when n = 1
, the desired output is:
tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to),
on = .(grp, time >= time_from, time <= time_to)][, .SD[1], by = grp]
grp time time_from time_to
1: 1 0.2 0.1 0.3
2: 2 0.6 0.5 0.7
but, something like:
> tmp_dt1[tmp_dt2, .(time = x.time[1], time_from[1], time_to[1]), on = .(grp, time >= time_from, time <= time_to), by = grp]
Error in `[.data.table`(tmp_dt1, tmp_dt2, .(time = x.time[1], time_from[1], :
object 'time_from' not found
does not work.
Using, .SD
comes close, but gives me a confusing end of result in terms of the columns selected:
tmp_dt1[tmp_dt2, .SD[1], on = .(grp, time >= time_from, time <= time_to), by = grp]
grp time
1: 1 0.2
2: 2 0.6
The reason why I do not want to do it in a chain is because of memory issues. Please note, I am only interested in solving this particular problem with the data.table
package.
Upvotes: 3
Views: 417
Reputation: 118789
Have you tried
tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to),
x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.
?
You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.
Upvotes: 2
Reputation: 8750
If you want to minimize the memory usage another solution may be more memory efficient than the original chaining approach even though it looks strange to store a temporary result in a variable (but it contains only two columns and only the first n rows per group) and still use chaining (but on a smaller subset of the original data):
n = 1 # parameter: first "n" rows per group
selected.rows <- tmp_dt1[tmp_dt2, .(rownum = .I[1:n]), on = .(grp, time >= time_from, time <= time_to), by = grp]
tmp_dt1[selected.rows$rownum][tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]
Not very elegant and maybe slower (it duplicates the join logic and requires to join twice - even though on a reduced sub set in the second case)...
The temporary result set contains the row number of each "match" in the original data table (using the .I
symbol of data.table
):
selected.rows
grp rownum
1: 1 1
2: 2 2
It would be great to compare this solution with chaining using a real big data table... (if I have more time I will profile this)
Upvotes: 1
Reputation: 887048
One option is to specify mult= first
tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), mult = "first",
on = .(grp, time >= time_from, time <= time_to)]
# grp time time_from time_to
#1: 1 0.2 0.1 0.3
#2: 2 0.6 0.5 0.7
Upvotes: 2