Reputation: 5088
Imagine a data.frame
that looks like this:
id t1 t2 t3 t4 t5
1 5 10 11 11 11
2 6 7 12 13 16
3 1 2 2 2 2
4 3 3 4 4 4
The numbers are consistently increasing across the rows, but at some point the number series reaches its maximum, and the same number is just repeated. How can I, using R
, convert this into a data.frame
where the "duplicated" numbers are replaced by a NA
or simply left blank? I.e., into something like this:
id t1 t2 t3 t4 t5
1 5 10 11 NA NA
2 6 7 12 13 16
3 1 2 NA NA NA
4 3 3 4 NA NA
Upvotes: 1
Views: 58
Reputation: 113
The easiest way I could see was to melt the table to a long format (using the data.table melt function), and write a custom function that takes in a number series and returns the expected output. I then just cast the table back to the original format. I'm sure there are some nifty cool solutions out there that are far better, but at least this works as expected.
library(data.table)
x <- data.table('id' = c(1,2,3,4),
't1' = c(5,6,1,3),
't2' = c(10, 7, 2, 3),
't3' = c(11, 12, 2, 4),
't4' = c(11, 13, 2, 4),
't5' = c(11, 16, 2, 4))
x <- melt.data.table(x, id.vars = 'id')
find_max_replace_subsequent <- function(numbers){
max_is <- max(numbers)
max_found <- 0
for(i in c(1:length(numbers))){
current_number <- numbers[i]
if(max_found == 1){
numbers[i] <- NA
}
if(current_number == max_is & max_found == 0){
max_found <- 1
}
}
return(numbers)
}
x[, 'value' := find_max_replace_subsequent(numbers = value), by = c(by1 = 'id')]
x <- dcast.data.table(x, formula = id~variable, value.var = 'value')
x
Upvotes: 1
Reputation: 21908
Thank you @Martin Gal for suggesting such a clever hint. Yes we could indeed dispense with the else
statement:
library(dplyr)
library(purrr)
df %>%
pmap_df(~ {x <- c(...)[-1]
ind <- which.max(x)
if(ind < length(x)) {
x[(ind + 1):length(x)] <- NA
}
c(c(...)[1], x)
})
# A tibble: 4 x 6
id t1 t2 t3 t4 t5
<int> <int> <int> <int> <int> <int>
1 1 5 10 11 NA NA
2 2 6 7 12 13 16
3 3 1 2 NA NA NA
4 4 3 3 4 NA NA
Upvotes: 2
Reputation: 160417
dat[cbind(FALSE, t(apply(dat[,-1], 1, function(z) duplicated(z) & z >= max(z))))] <- NA
dat
# id t1 t2 t3 t4 t5
# 1 1 5 10 11 NA NA
# 2 2 6 7 12 13 16
# 3 3 1 2 NA NA NA
# 4 4 3 3 4 NA NA
Breakdown:
Since we need to work row-wise, we'll use apply(dat, 1, .)
.
On each row, we need those that are at-or-above the max value and duplicated, ergo the anon-func
function(z) duplicated(z) & z >= max(z)
This by itself produces a transposed matrix (because of how R's apply
operates), which we then t
ranspose into a correctly-shaped logical matrix:
t(apply(dat[,-1], 1, function(z) duplicated(z) & z >= max(z)))
# t1 t2 t3 t4 t5
# [1,] FALSE FALSE FALSE TRUE TRUE
# [2,] FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE TRUE TRUE
We omitted the id
column with dat[,-1]
, but for reassigning NA
, we need to cbind(FALSE, .)
so that the id
column is preserved.
Lastly, we reassign to just those fields by using dat[.] <- NA
.
PS: the alternate functions used in other answers works just as well here:
# equivalent with this sample data
function(z) duplicated(z) & z >= max(z)
function(z) seq_along(z) > which.max(z)
The largest differences in the answers (so far) is a preference towards R dialects, whether base or dplyr+purrr
.
Data
dat <- structure(list(id = 1:4, t1 = c(5L, 6L, 1L, 3L), t2 = c(10L, 7L, 2L, 3L), t3 = c(11L, 12L, NA, 4L), t4 = c(NA, 13L, NA, NA), t5 = c(NA, 16L, NA, NA)), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 3
Reputation: 39858
One option using dplyr
and purrr
could be:
df %>%
mutate(pmap_dfr(across(-id), ~ `[<-`(c(...), seq_along(c(...)) > which.max(c(...)), NA)))
id t1 t2 t3 t4 t5
1 1 5 10 11 NA NA
2 2 6 7 12 13 16
3 3 1 2 NA NA NA
4 4 3 3 4 NA NA
Upvotes: 2