Reputation: 4613
This is related to this question. I have data like this:
x t
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
I'd like to flag the last observation in every group (and keep the other observations), defined by x
, where the "last" observation is defined by t
. I tried this:
dt[order(x, t), flag_last := 1, by = "x", mult = "last"]
but that returns
x t flag_last
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 1
5: 2 2 1
6: 2 3 1
The desired output is
x t flag_last
1: 1 1 0
2: 1 2 0
3: 1 3 1
4: 2 1 0
5: 2 2 0
6: 2 3 1
Am I going about this the wrong way?
A couple of caveats:
The actual dataset is roughly 61 GB and there are only a couple of observations per x
group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do.
Obviously this is simple data. The number of observations within each group is not necessarily the same, and the values for t
differ too, so simply picking out t == 3
will not work.
Upvotes: 0
Views: 872
Reputation: 28695
One option is to use .N
and which.max
to check for equality between the row index and the row index at which t is maximized
df[, flag := as.integer(1:.N == which.max(t)), x]
But benchmarking shows replace
is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I
is fastest.
df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))
library(microbenchmark)
microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
df[, flag := 0L]
df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)
# Unit: seconds
# expr min lq mean median uq max neval cld
# replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990 20 b
# use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207 20 c
# use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942 20 c
# use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353 20 bc
# use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989 20 a
Upvotes: 2
Reputation: 33488
dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]
# x t flag_last
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 0
# 6: 2 3 1
Upvotes: 2
Reputation: 279
This would do the trick, if you create an id variable that you can then use to merge the two datasets together:
library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))
data <- data.frame(x,t, id)
You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.
sliced <- data %>%
group_by(x) %>%
slice(which.max(t)) %>%
ungroup() %>%
select(id) %>%
mutate(max = "max_group")
tot <- left_join(data, sliced, c("id"))
The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.
Upvotes: 1
Reputation: 269694
Use the built-in .I
like this:
DT[, is.end := .I == last(.I), by = "x"]
Upvotes: 3