cpanagakos
cpanagakos

Reputation: 35

Continual summation of a column in R until condition is met

I am doing my best to learn R, and this is my first post on this forum.

I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:

x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)

df <- data.frame("x" = x, "counter" = 0)

    x counter
1  NA       0
2   1       0
3   0       0
4   0       0
5   0       0
6   0       0
7   1       0
8   1       0
9   1       0
10  1       0
11  0       0
12  1       0

I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:

    x counter
1  NA       NA
2   1       1
3   0       0
4   0       0
5   0       0
6   0       0
7   1       1
8   1       2
9   1       3
10  1       4
11  0       0
12  1       1

I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.

Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.

Upvotes: 1

Views: 753

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

You can use :

library(dplyr)

df %>%
  group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
  mutate(counter = (row_number() - 1) * x) %>%
  ungroup %>%
  select(-x1)

#       x counter
#   <dbl>   <dbl>
# 1    NA      NA
# 2     1       1
# 3     0       0
# 4     0       0
# 5     0       0
# 6     0       0
# 7     1       1
# 8     1       2
# 9     1       3
#10     1       4
#11     0       0
#12     1       1

Explaining the steps -

  • Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
  • For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.

Upvotes: 1

crestor
crestor

Reputation: 1466

Welcome @cpanagakos.

In dplyr::lag it's not posibble to use a column that still doesn't exist. (It can't refer to itself.) https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/

For example:

library(tidyverse)


df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))

# error: lag cannot refer to a column that still doesn't exist
df %>%
  mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
                             x == 0   ~ 0,
                             x == 1   ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.

So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like @Ronald Shah and others suggest):

Create sequential counter that restarts on a condition within panel data groups

df %>%
  group_by(x1 = cumsum(!coalesce(x, 0))) %>%
  mutate(counter = row_number() - 1) %>%
  ungroup()
#> # A tibble: 12 x 3
#>        x    x1 counter
#>    <dbl> <int>   <dbl>
#>  1    NA     1      NA
#>  2     1     1       1
#>  3     0     2       0
#>  4     0     3       0
#>  5     0     4       0
#>  6     0     5       0
#>  7     1     5       1
#>  8     1     5       2
#>  9     1     5       3
#> 10     1     5       4
#> 11     0     6       0
#> 12     1     6       1

This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.

Upvotes: 1

Related Questions