jmogil
jmogil

Reputation: 153

Conditional rolling sum across columns

I have a data frame of values across successive years (columns) for unique individuals (rows). A dummy data example is provided here:

dt = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0, 
0.8219178, 0, 0.1369863, 0, 1.369863, 0.2739726, 0.8219178, 5, 
0), `2016` = c(0, 1.369863, 0, 0.2739726, 0, 0.2739726, 0, 3.2876712, 
0, 0), `2017` = c(0.6849315, 0, 0, 0.6849315, 0, 0.5479452, 0, 
0, 0, 0), `2018` = c(1.0958904, 0.5479452, 1.9178082, 0, 0, 0, 
0, 0, 0, 3), `2019` = c(0, 0, 0, 1.0958904, 0, 0.9589041, 0.5479452, 
0, 0, 0), `2020` = c(0.4383562, 0, 0, 0, 0.2739726, 0.6849315, 
0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))

I want to create a dataset where the maximum value for each individual that should appear for each year is 1. In cases where it exceeds this value, I want to carry the excess value over 1 into the next year (column) and sum it to that year's value for each individual and so on.

The expected result is:

dt_expected = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0, 
0.8219178, 0, 0.1369863, 0, 1, 0.2739726, 0.8219178, 1, 0), `2016` = c(0, 
1, 0, 0.2739726, 0, 0.6438356, 0, 1, 1, 0), `2017` = c(0.6849315, 
0.369863, 0, 0.6849315, 0, 0.5479452, 0, 1, 1, 0), `2018` = c(1, 
0.5479452, 1, 0, 0, 0, 0, 1, 1, 1), `2019` = c(0.0958904, 0, 
0.9178082, 1, 0, 0.9589041, 0.5479452, 0.2876712, 1, 1), `2020` = c(0.4383562, 
0, 0, 0.0958904, 0.2739726, 0.6849315, 0, 0, 0, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

I am at a total loss of where to start with this problem, so any assistance achieving this using data.table would be greatly appreciated. My only thought is to use lapply with an ifelse function for the conditional component. Then should I be using rowSums or Reduce to achieve my outcome of shifting excess values across columns?

Upvotes: 5

Views: 180

Answers (4)

Stefano Barbi
Stefano Barbi

Reputation: 3194

Here is my solution:

dt |>
  pivot_longer(cols = -ID, "year") |>
  arrange(ID, year) |>
  group_by(ID) |>
  mutate(x = {
    r <- accumulate(value,
                    ~max(0,.y + .x - 1),
                    .init = 0)
    pmin(1, value + head(r, -1))
  }) |>
  select(x, year, ID) |>
  pivot_wider(names_from = "year", values_from = "x")

##> + # A tibble: 10 × 7
##> # Groups:   ID [10]
##>       ID `2015` `2016` `2017` `2018` `2019` `2020`
##>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##>  1     1  0      0      0.685  1     0.0959 0.438 
##>  2     2  0.822  1      0.370  0.548 0      0     
##>  3     3  0      0      0      1     0.918  0     
##>  4     4  0.137  0.274  0.685  0     1      0.0959
##>  5     5  0      0      0      0     0      0.274 
##>  6     6  1      0.644  0.548  0     0.959  0.685 
##>  7     7  0.274  0      0      0     0.548  0     
##>  8     8  0.822  1      1      1     0.288  0     
##>  9     9  1      1      1      1     1      0     
##> 10    10  0      0      0      1     1      1     


Upvotes: 1

Martin Morgan
Martin Morgan

Reputation: 46886

Not particularly pretty or efficient, but as a starting point I used pmin() and pmax() to update each year (and the subsequent year), iteratively. The current year is the minimum of the current year and 1 (pmin(x, 1)); the subsequent year is the current subsequent year plus the excess of the previous year (pmax(x - 1, 0))

update <- function(df) {
    result = df
    for (idx in 2:(ncol(df) - 1)) {
        x = result[[ idx ]]
        result[[ idx ]] = pmin(x, 1)
        result[[ idx + 1 ]] = result[[ idx + 1 ]] + pmax(x - 1, 0)
    }
    result
}

We have

> all.equal(update(dt), dt_expected)
[1] TRUE

I don't know how to translate this into efficient data.table syntax, but the function 'works' as is on a data.table, update(as.data.table(dt)).

Upvotes: 5

Jaap
Jaap

Reputation: 83275

A translation of Martin Morgan's answer to :

for (i in 2:(ncol(dt) - 1)) {
  x = dt[[i]]
  set(dt, j = i, value = pmin(x, 1))
  set(dt, j = i + 1, value = dt[[i + 1L]] + pmax(x - 1, 0))
}

Upvotes: 6

shs
shs

Reputation: 3899

Not sure if there is a more efficient way with built in functions, but I simply wrote a recursive function that implements your described algorithm for the rows and then apply it over every row.

f <- function(l, rest = 0, out = list()) {
  if (length(l) == 0) return(unlist(out))
  if (l[[1]] + rest <= 1) {
    f(l[-1], rest = 0, out = append(out, list(l[[1]] + rest)))
  } else (
    f(l[-1], rest = l[[1]] + rest - 1, out = append(out, list(1)))
  )
}

dt[-1] <- apply(dt[-1], 1, f, simplify = F) |> 
  do.call(what = rbind)

dt
#> # A tibble: 10 × 7
#>       ID `2015` `2016` `2017` `2018` `2019` `2020`
#>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1     1  0      0      0.685  1     0.0959 0.438 
#>  2     2  0.822  1      0.370  0.548 0      0     
#>  3     3  0      0      0      1     0.918  0     
#>  4     4  0.137  0.274  0.685  0     1      0.0959
#>  5     5  0      0      0      0     0      0.274 
#>  6     6  1      0.644  0.548  0     0.959  0.685 
#>  7     7  0.274  0      0      0     0.548  0     
#>  8     8  0.822  1      1      1     0.288  0     
#>  9     9  1      1      1      1     1      0     
#> 10    10  0      0      0      1     1      1

Created on 2022-03-25 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions