Reputation: 2827
My question is related to Subset by group with data.table but different.
Imagine a data set like this:
tmp <- data.table(x = 1:10, y = c(27, 70, 54, 18, 50, 44, 22, 73, 6, 5))
For each row of the data, I want to calculate a new value, z, which is the min(y)
for all rows with a larger value of x. For instance, for the third row of the data where x is 3, I want min(y)
among rows with x > 3 (which would be the value 5). For our intents and purposes, you can assume the data is already ordered by x.
At first I thought of using a function like this:
min.y <- function(val, dt) {
dt[x > val, min(y)]
}
But calling tmp[, z:= fun(x, tmp)]
will result in a warning message:
In min(y) : no non-missing arguments to min; returning Inf
What is the proper way to do this?
PS: Obviously, for the last row I expect to get NA as the result
Upvotes: 3
Views: 112
Reputation: 269852
Here are some solutions:
1) rollapply Assuming tmp
is sorted (sort it if not), we can use rollapply
to get a compact solution like this. Note that when the width
argument of rollapply
is a list its elements are regarded as vectors of offsets over which to apply min
.
library(data.table)
library(zoo)
tmp[, min := rollapply(y, lapply(pmax(.N:1-1, 1), seq), min, fill = NA)]
giving:
x y min
1: 1 27 5
2: 2 70 5
3: 3 54 5
4: 4 18 5
5: 5 50 5
6: 6 44 5
7: 7 22 5
8: 8 73 5
9: 9 6 5
10: 10 5 NA
2) sqldf Using SQL join tmp
to itself using the indicated condition and take the min over the group like this. tmp
need not be sorted.
library(data.table)
library(sqldf)
sqldf("select a.*, min(b.y) min
from tmp a left join tmp b on b.x > a.x group by a.rowid")
giving:
x y min
1 1 27 5
2 2 70 5
3 3 54 5
4 4 18 5
5 5 50 5
6 6 44 5
7 7 22 5
8 8 73 5
9 9 6 5
10 10 5 NA
Upvotes: 2
Reputation: 14360
Method 1:
Since you said we can assume the data is sorted on x
you could use a cumulative minimum starting from the end of y
. We cut out the first observation so that we do a >
search rather than >=
:
tmp$min_y <- c(rev(cummin(rev(tmp$y[-1]))), NA)
Update: The old method effectively did a >=
search rather than >
. Updated to do >
.
Method 2: Data.table
If you wanted to use data.table
you could try to group by each row and then subset within J
. The ifelse is needed so that when we are at our last row we don't take the min of no values:
tmp[, "min_y" := {curr_x <- x
tmp_subs <- tmp[x > curr_x]
ifelse(nrow(tmp_subs)>0, min(tmp[x > curr_x][["y"]]), NA_real_)},
by = 1:nrow(tmp)]
tmp
# x y min_y
# 1: 1 27 5
# 2: 2 70 5
# 3: 3 54 5
# 4: 4 18 5
# 5: 5 50 5
# 6: 6 44 5
# 7: 7 22 5
# 8: 8 73 5
# 9: 9 6 5
#10: 10 5 NA
Since 5
is the smallest value which is at the end everything will be 5
. Let's make this a little more interesting:
tmp <- data.table(x = 1:10, y = c(27, 70, 54, 18, 50, 44, 22, 73, 47, 58))
Our result will be:
# x y min_y
# 1: 1 27 18
# 2: 2 70 18
# 3: 3 54 18
# 4: 4 18 22
# 5: 5 50 22
# 6: 6 44 22
# 7: 7 22 47
# 8: 8 73 47
# 9: 9 47 58
#10: 10 58 NA
Upvotes: 5
Reputation: 31452
For what its worth, one more possible approach (not sure if its any better or worse than the others):
tmp[, z := min(tmp$y[(.I+1):NROW(tmp)]), by = 1:NROW(tmp)]
Upvotes: 2
Reputation: 887501
One option would be a self-non-equi join
tmp[, min_y := .SD[.SD, min(y, na.rm = TRUE), on = .(x > x),
by = .EACHI]$V1][is.infinite(min_y), min_y := NA_real_][]
# x y min_y
# 1: 1 27 5
# 2: 2 70 5
# 3: 3 54 5
# 4: 4 18 5
# 5: 5 50 5
# 6: 6 44 5
# 7: 7 22 5
# 8: 8 73 5
# 9: 9 6 5
#10: 10 5 NA
Upvotes: 3