Reputation: 1361
My question is very similar to this one: How to extract the first n rows per group?
dt
date age name val
1: 2000-01-01 3 Andrew 93.73546
2: 2000-01-01 4 Ben 101.83643
3: 2000-01-01 5 Charlie 91.64371
4: 2000-01-02 6 Adam 115.95281
5: 2000-01-02 7 Bob 103.29508
6: 2000-01-02 8 Campbell 91.79532
We have a dt
and I've added an extra column named val
. First, we want to extract the first n rows within each group.
The solutions from the link provided are:
dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed
My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:
# uses other columns for results/ is dependent on subsetted rows
# but keep it simple for replication
do_something <- function(dt){
res <- ifelse(cumsum(dt$val) > 200, 1, 0)
return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]
date age name val funcVal
1: 2000-01-01 3 Andrew 93.73546 0
2: 2000-01-01 4 Ben 101.83643 1
3: 2000-01-02 6 Adam 115.95281 0
4: 2000-01-02 7 Bob 103.29508 1
Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?
Any help is appreciated and below is the code to produce the data above:
date <- c("2000-01-01","2000-01-01","2000-01-01",
"2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)
Upvotes: 2
Views: 344
Reputation: 66819
In case there's more than one grouping column, it might be more efficient to collapse to one:
m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]
This is just an extension to @eddi's approach (of keeping row numbers .I
, seen in @akrun's answer) to also keep group counter .GRP
.
Re OP's comment that they're more concerned about the function, well, borrowing from @akrun, there's ...
ff = function(x) as.integer(cumsum(x[[1]]) > 200)
Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.
My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.
Upvotes: 5
Reputation: 887108
We can use as.integer
on the cumsum
to coerce the logical to binary. Extract the row index, specify it as i
, grouped by 'date', apply the function on the 'val' column
f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]
Upvotes: 3