Michael A
Michael A

Reputation: 4613

How do I flag the last observation in a group while maintaining a specific sort order within the group?

This is related to this question. I have data like this:

   x t
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3

I'd like to flag the last observation in every group (and keep the other observations), defined by x, where the "last" observation is defined by t. I tried this:

dt[order(x, t), flag_last := 1, by = "x", mult = "last"]

but that returns

   x t flag_last
1: 1 1         1
2: 1 2         1
3: 1 3         1
4: 2 1         1
5: 2 2         1
6: 2 3         1

The desired output is

   x t flag_last
1: 1 1         0
2: 1 2         0
3: 1 3         1
4: 2 1         0
5: 2 2         0
6: 2 3         1

Am I going about this the wrong way?


A couple of caveats:

Upvotes: 0

Views: 872

Answers (4)

IceCreamToucan
IceCreamToucan

Reputation: 28695

One option is to use .N and which.max to check for equality between the row index and the row index at which t is maximized

df[, flag := as.integer(1:.N == which.max(t)), x]

But benchmarking shows replace is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I is fastest.

df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))

library(microbenchmark)

microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
  df[, flag := 0L]
  df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)


# Unit: seconds
#       expr      min       lq     mean   median       uq      max neval cld
#    replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990    20  b 
#      use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207    20   c
#    use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942    20   c
#      use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353    20  bc
#  use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989    20 a  

Upvotes: 2

s_baldur
s_baldur

Reputation: 33488

dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]


#    x t flag_last
# 1: 1 1         0
# 2: 1 2         0
# 3: 1 3         1
# 4: 2 1         0
# 5: 2 2         0
# 6: 2 3         1

Upvotes: 2

Giulia
Giulia

Reputation: 279

This would do the trick, if you create an id variable that you can then use to merge the two datasets together:

library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))

data <- data.frame(x,t, id)

You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.

sliced <- data %>%
  group_by(x) %>%
  slice(which.max(t)) %>%
  ungroup() %>%
  select(id) %>%
  mutate(max = "max_group")

tot <- left_join(data, sliced, c("id"))

The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 269694

Use the built-in .I like this:

DT[, is.end := .I == last(.I), by = "x"]

Upvotes: 3

Related Questions