Gopala
Gopala

Reputation: 10483

Need to get R cummax but dealing properly with NAs

I have a data frame like this:

dput(df1)
structure(list(x = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), y = c(16449L, NA, NA, 
16449L, 16450L, 16451L, NA, NA, 16455L, 16456L, NA, NA, 16756L, 
NA, 16460L, 16464L, 16469L, NA, NA, 16469L)), .Names = c("x", 
"y"), row.names = c(NA, -20L), class = "data.frame")

I need to mutate the y column as follows (using dplyr):

df1 <- mutate(df1, y = ifelse(is.na(y), cummax(y), y))

However, cummax does not handle NAs right for my case. How can I get the same effect with some alternative method?

Resulting output should have the NA rows of y filled with the last previous non-NA value of y. They are arranged in order.

Alternatively, I tried something like this, and it does not work:

mutate(df1, y = ifelse(is.na(y), max(y[1:row_number()], na.rm = TRUE), y)

Because row_number() itself is a vector of 1 through current row, it creates error.

EDIT: Desired output is as follows:

structure(list(x = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), y = c(16449, 16449, 
16449, 16449, 16450, 16451, 16451, 16451, 16455, 16456, 16456, 
16456, 16756, 16756, 16460, 16464, 16469, 16756, 16756, 16469
)), class = "data.frame", .Names = c("x", "y"), row.names = c(NA, 
-20L))

Upvotes: 1

Views: 1425

Answers (3)

Gregor Thomas
Gregor Thomas

Reputation: 145775

If the input starts with NA, then the cummax should be NA until the first non-NA value. Here's a solution that takes that into account:

cummax_na = function(x, na.rm = TRUE) {
  if(!na.rm) return(cummax(x))
  if(all(is.na(x))) return(x)
  # check for leading NAs to keep
  first_non_na = match(TRUE, !is.na(x))
  x = dplyr::coalesce(x, -Inf)
  result = cummax(x)
  if(first_non_na > 1) result[1:(first_non_na - 1)] = NA
  result
}
cummax_na(c(1, 4, 3, 2, 1))
# [1] 1 4 4 4 4
cummax_na(c(1, 4, 3, NA, 1, 5))
# [1] 1 4 4 4 4 5
cummax_na(c(NA, 1, 4, 3, NA, 1, 5))
# [1] NA  1  4  4  4  4  5
cummax_na(c(NA, NA, 1, 4, 3, NA, 1, 5))
# [1] NA NA  1  4  4  4  4  5

Upvotes: 3

David Heckmann
David Heckmann

Reputation: 2939

Reduce() would be another option:

df1 %>% mutate( y.cummax =  ifelse( is.na(y) , Reduce( function(x1, x2) { max(x1,x2 ,na.rm =TRUE) } , y , accumulate = TRUE   ) , y ) )

    x     y
1   1 16449
2   2 16449
3   3 16449
4   4 16449
5   5 16450
6   6 16451
7   7 16451
8   8 16451
9   9 16455
10 10 16456
11  1 16456
12  2 16456
13  3 16756
14  4 16756
15  5 16460
16  6 16464
17  7 16469
18  8 16756
19  9 16756
20 10 16469

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31171

You can do:

library(dplyr)

v = cummax(ifelse(is.na(df1$y), -Inf, df1$y))  #A. Webb suggested -Inf instead of 0, great!

mutate(df1, y=ifelse(is.na(y), v, y))

#    x     y
#1   1 16449
#2   2 16449
#3   3 16449
#4   4 16449
#5   5 16450
#6   6 16451
#7   7 16451
#8   8 16451
#9   9 16455
#10 10 16456
#11  1 16456
#12  2 16456
#13  3 16756
#14  4 16756
#15  5 16460
#16  6 16464
#17  7 16469
#18  8 16756
#19  9 16756
#20 10 16469

Or you can use data.table:

setDT(transform(df1,ix=1:nrow(df1)))[,max(df1$y[1:ix],na.rm=T),by=.(ix)]

Upvotes: 6

Related Questions