histelheim
histelheim

Reputation: 5088

Replace duplicates with NA after the maximum in a series has been reached

Imagine a data.frame that looks like this:

id  t1  t2  t3  t4  t5
1   5   10  11  11  11
2   6   7   12  13  16
3   1   2   2   2   2
4   3   3   4   4   4

The numbers are consistently increasing across the rows, but at some point the number series reaches its maximum, and the same number is just repeated. How can I, using R, convert this into a data.frame where the "duplicated" numbers are replaced by a NA or simply left blank? I.e., into something like this:

id  t1  t2  t3  t4  t5
1   5   10  11  NA  NA
2   6   7   12  13  16
3   1   2   NA  NA   NA
4   3   3   4   NA   NA

Upvotes: 1

Views: 58

Answers (4)

Adverse_Event
Adverse_Event

Reputation: 113

The easiest way I could see was to melt the table to a long format (using the data.table melt function), and write a custom function that takes in a number series and returns the expected output. I then just cast the table back to the original format. I'm sure there are some nifty cool solutions out there that are far better, but at least this works as expected.

library(data.table)
x <- data.table('id' = c(1,2,3,4),
                't1' = c(5,6,1,3),
                't2' = c(10, 7, 2, 3),
                't3' = c(11, 12, 2, 4),
                't4' = c(11, 13, 2, 4),
                't5' = c(11, 16, 2, 4))

x <- melt.data.table(x, id.vars = 'id')

find_max_replace_subsequent <- function(numbers){
  
  max_is <- max(numbers)
  max_found <- 0

  for(i in c(1:length(numbers))){
    current_number <- numbers[i]

    if(max_found == 1){
      numbers[i] <- NA

    }
    if(current_number == max_is & max_found == 0){
      max_found <- 1
    }
    
  }
  
  return(numbers)
}
x[, 'value' := find_max_replace_subsequent(numbers = value), by = c(by1 = 'id')]

x <- dcast.data.table(x, formula = id~variable, value.var = 'value')
x

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21908

Thank you @Martin Gal for suggesting such a clever hint. Yes we could indeed dispense with the else statement:

library(dplyr)
library(purrr)

df %>%
  pmap_df(~ {x <- c(...)[-1]
  ind <- which.max(x)
  if(ind < length(x)) {
    x[(ind + 1):length(x)] <- NA
  } 
  c(c(...)[1], x)
  })

# A tibble: 4 x 6
     id    t1    t2    t3    t4    t5
  <int> <int> <int> <int> <int> <int>
1     1     5    10    11    NA    NA
2     2     6     7    12    13    16
3     3     1     2    NA    NA    NA
4     4     3     3     4    NA    NA

Upvotes: 2

r2evans
r2evans

Reputation: 160417

dat[cbind(FALSE, t(apply(dat[,-1], 1, function(z) duplicated(z) & z >= max(z))))] <- NA
dat
#   id t1 t2 t3 t4 t5
# 1  1  5 10 11 NA NA
# 2  2  6  7 12 13 16
# 3  3  1  2 NA NA NA
# 4  4  3  3  4 NA NA

Breakdown:

  1. Since we need to work row-wise, we'll use apply(dat, 1, .).

  2. On each row, we need those that are at-or-above the max value and duplicated, ergo the anon-func

    function(z) duplicated(z) & z >= max(z)
    

    This by itself produces a transposed matrix (because of how R's apply operates), which we then transpose into a correctly-shaped logical matrix:

    t(apply(dat[,-1], 1, function(z) duplicated(z) & z >= max(z)))
    #         t1    t2    t3    t4    t5
    # [1,] FALSE FALSE FALSE  TRUE  TRUE
    # [2,] FALSE FALSE FALSE FALSE FALSE
    # [3,] FALSE FALSE  TRUE  TRUE  TRUE
    # [4,] FALSE FALSE FALSE  TRUE  TRUE
    
  3. We omitted the id column with dat[,-1], but for reassigning NA, we need to cbind(FALSE, .) so that the id column is preserved.

  4. Lastly, we reassign to just those fields by using dat[.] <- NA.

PS: the alternate functions used in other answers works just as well here:

# equivalent with this sample data
function(z) duplicated(z) & z >= max(z)
function(z) seq_along(z) > which.max(z)

The largest differences in the answers (so far) is a preference towards R dialects, whether base or dplyr+purrr.


Data

dat <- structure(list(id = 1:4, t1 = c(5L, 6L, 1L, 3L), t2 = c(10L, 7L, 2L, 3L), t3 = c(11L, 12L, NA, 4L), t4 = c(NA, 13L, NA, NA), t5 = c(NA, 16L, NA, NA)), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 3

tmfmnk
tmfmnk

Reputation: 39858

One option using dplyr and purrr could be:

df %>%
    mutate(pmap_dfr(across(-id), ~ `[<-`(c(...), seq_along(c(...)) > which.max(c(...)), NA)))

  id t1 t2 t3 t4 t5
1  1  5 10 11 NA NA
2  2  6  7 12 13 16
3  3  1  2 NA NA NA
4  4  3  3  4 NA NA

Upvotes: 2

Related Questions