This is related to this question . I have data like this: x t 1: 1 1 2: 1 2 3: 1 3 4: 2 1 5: 2 2 6: 2 3 I'd like to flag the last observation in every group (and keep the other observations), defined by x , where the "last" observation is defined by t . I tried this: dt[order(x, t), flag_last := 1, by = "x", mult = "last"] but that returns x t flag_last 1: 1 1 1 2: 1 2 1 3: 1 3 1 4: 2 1 1 5: 2 2 1 6: 2 3 1 The desired output is x t flag_last 1: 1 1 0 2: 1 2 0 3: 1 3 1 4: 2 1 0 5: 2 2 0 6: 2 3 1 Am I going about this the wrong way? A couple of caveats: The actual dataset is roughly 61 GB and there are only a couple of observations per x group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do. Obviously this is simple data. The number of observations within each group is not necessarily the same , and the values for t differ too, so simply picking out t == 3 will not work.

Reputation: 4613

How do I flag the last observation in a group while maintaining a specific sort order within the group?

This is related to this question. I have data like this:

I'd like to flag the last observation in every group (and keep the other observations), defined by x, where the "last" observation is defined by t. I tried this:

dt[order(x, t), flag_last := 1, by = "x", mult = "last"]

but that returns

   x t flag_last
1: 1 1         1
2: 1 2         1
3: 1 3         1
4: 2 1         1
5: 2 2         1
6: 2 3         1

The desired output is

   x t flag_last
1: 1 1         0
2: 1 2         0
3: 1 3         1
4: 2 1         0
5: 2 2         0
6: 2 3         1

Am I going about this the wrong way?

A couple of caveats:

The actual dataset is roughly 61 GB and there are only a couple of observations per x group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do.
Obviously this is simple data. The number of observations within each group is not necessarily the same, and the values for t differ too, so simply picking out t == 3 will not work.

Upvotes: 0

Answers (4)

IceCreamToucan

Reputation: 28695

One option is to use .N and which.max to check for equality between the row index and the row index at which t is maximized

df[, flag := as.integer(1:.N == which.max(t)), x]

But benchmarking shows replace is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I is fastest.

df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))

library(microbenchmark)

microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
  df[, flag := 0L]
  df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)


# Unit: seconds
#       expr      min       lq     mean   median       uq      max neval cld
#    replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990    20  b 
#      use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207    20   c
#    use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942    20   c
#      use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353    20  bc
#  use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989    20 a

Upvotes: 2

s_baldur

Reputation: 33488

dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]


#    x t flag_last
# 1: 1 1         0
# 2: 1 2         0
# 3: 1 3         1
# 4: 2 1         0
# 5: 2 2         0
# 6: 2 3         1

Upvotes: 2

Giulia

Reputation: 279

This would do the trick, if you create an id variable that you can then use to merge the two datasets together:

library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))

data <- data.frame(x,t, id)

You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.

sliced <- data %>%
  group_by(x) %>%
  slice(which.max(t)) %>%
  ungroup() %>%
  select(id) %>%
  mutate(max = "max_group")

tot <- left_join(data, sliced, c("id"))

The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.

Upvotes: 1

G. Grothendieck

Reputation: 269694

Use the built-in .I like this:

DT[, is.end := .I == last(.I), by = "x"]

Upvotes: 3

How do I flag the last observation in a group while maintaining a specific sort order within the group?

Answers (4)

Related Questions