GBPU
GBPU

Reputation: 664

Set last N values of dataframe to NA in R

I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:

df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))

But am struggling to get it to work for any N

This is my current attempt:

df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))

But this seems to fail for groups of size 1

I also tried:

df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))

But the else clause never takes effect.

Any advice would be appreciated, thanks.

Upvotes: 5

Views: 608

Answers (3)

Greg
Greg

Reputation: 3326

You can use case_when() for a vectorized solution in dplyr.

library(dplyr)

df %>%
  group_by(ID) %>%
  mutate(target = case_when(row_number() <= n() - N ~ target))

My thanks to @akrun for pointing out that case_when() defaults to an NA of the proper type, so case_when() automatically fills the last N with NAs.

Update

The solution by @akrun is more performant: when benchmarked at times = 50 repetitions apiece

library(microbenchmark)

big_df <- tibble(ID = rep(letters, 100000)) %>%
  mutate(target = row_number()) %>%
  group_by(ID)

microbenchmark(
  times = 50,
  Greg = {
    big_df %>%
      mutate(target = case_when(row_number() <= n() - N ~ target))
  },
  akrun = {
    big_df %>%
      mutate(target = replace(target, tail(row_number(), N), NA))
  }
)

it is about 35% faster than mine at scale (2600000 rows):

Unit: milliseconds
  expr     min      lq      mean   median       uq      max neval
  Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570    50
 akrun 52.4519 55.8314  63.40997 61.43945  64.1082 196.4069    50

Upvotes: 5

TarJae
TarJae

Reputation: 78917

Here is an alternative suggestion:

We could define the top N after using arrange in descending order (with -x), apply our ifelse statement and rearrange:

library(dplyr)

N = 2

df %>% 
  group_by(id) %>% 
  arrange(-x, .by_group = TRUE) %>% 
  mutate(x = ifelse(row_number()== 1:N, NA, x)) %>% 
  arrange(x, .by_group = TRUE)
df <- structure(list(x = c(2, 4, 6, 1, 2, 5, 6, 7, 3, 4, 5, 6), id = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -12L))

Upvotes: 2

akrun
akrun

Reputation: 887048

We could use

library(dplyr)
df %>% 
   group_by(ID) %>%
   mutate(target = replace(target, tail(row_number(), N), NA))

Upvotes: 5

Related Questions