Rich Scriven
Rich Scriven

Reputation: 99371

Calculate cumsum() while ignoring NA values

Consider the following named vector x.

( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a  b  c  d  e  f  g  h 
# 1  2  0 NA  4 NA NA  6 

I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.

I can do it this way.

y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a  b  c  d  e  f  g  h 
# 1  3  3 NA  7 NA NA 13 

But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.

What better methods are there to return the cumulative sum while effectively ignoring NA values?

Upvotes: 64

Views: 43233

Answers (6)

jblood94
jblood94

Reputation: 17001

Benchmarking several options. collapse::fcumsum is the fastest by far.

library(dplyr)
library(tidyr)
library(collapse)

x <- runif(1e5)
x[sample(1e5, 1e4)] <- NA

microbenchmark::microbenchmark(
  ifelse = cumsum(ifelse(is.na(x), 0, x)) + x*0,
  coalesce = cumsum(coalesce(x, 0)) + x*0,
  na.omit = "[<-"(x, !is.na(x), cumsum(na.omit(x))),
  is.na = local({b <- !is.na(x); "[<-"(x, b, cumsum(x[b]))}),
  fcumsum = fcumsum(x),
  check = "equal"
)
#> Unit: microseconds
#>      expr    min      lq     mean  median      uq    max neval
#>    ifelse 1808.4 2672.40 3290.323 2853.80 3178.25 8807.5   100
#>  coalesce 2575.8 3543.45 4427.820 3890.20 5344.55 8142.4   100
#>   na.omit 1314.6 2056.25 2547.983 2231.50 2467.40 6259.2   100
#>     is.na  910.5 1472.50 2020.346 1698.80 1955.75 5431.0   100
#>   fcumsum  137.2  255.35  282.999  267.15  313.75  513.4   100

Upvotes: 0

Quinten
Quinten

Reputation: 41601

Another option is using the collapse package with fcumsum function like this:

( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#>  a  b  c  d  e  f  g  h 
#>  1  2  0 NA  4 NA NA  6
library(collapse)
fcumsum(x)
#>  a  b  c  d  e  f  g  h 
#>  1  3  3 NA  7 NA NA 13

Created on 2022-08-24 with reprex v2.0.2

Upvotes: 2

DJV
DJV

Reputation: 4873

It's an old question but tidyr gives a new solution. Based on the idea of replacing NA with zero.

require(tidyr)

cumsum(replace_na(x, 0))

 a  b  c  d  e  f  g  h 
 1  3  3  3  7  7  7 13 

Upvotes: 37

josliber
josliber

Reputation: 44340

You can do this in one line with:

cumsum(ifelse(is.na(x), 0, x)) + x*0
#  a  b  c  d  e  f  g  h 
#  1  3  3 NA  7 NA NA 13

Or, similarly:

library(dplyr)
cumsum(coalesce(x, 0)) + x*0
#  a  b  c  d  e  f  g  h 
#  1  3  3 NA  7 NA NA 13 

Upvotes: 56

Rich Scriven
Rich Scriven

Reputation: 99371

Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.

cumSkipNA <- function(x, FUNC)
{
    d <- deparse(substitute(FUNC))
    funs <- c("max", "min", "prod", "sum")
    stopifnot(is.vector(x), is.numeric(x), d %in% funs)
    FUNC <- match.fun(paste0("cum", d))
    x[!is.na(x)] <- FUNC(x[!is.na(x)])
    x
}

set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1]  4 NA  9 14 NA 14 NA 10 10  1
cumSkipNA(x, sum)
# [1]  4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1]      4     NA     36    504     NA   7056     NA
# [8]  70560 705600 705600
cumSkipNA(x, min)
# [1]  4 NA  4  4 NA  4 NA  4  4  1
cumSkipNA(x, max)
# [1]  4 NA  9 14 NA 14 NA 14 14 14 

Definitely nothing new, but maybe useful to someone.

Upvotes: 12

lebatsnok
lebatsnok

Reputation: 6479

Do you want something like this:

x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])

x2

[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -

miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum

Upvotes: 30

Related Questions