principe
principe

Reputation: 303

Remove rows after a group by based on a column value

I have a data set of different IDs, Date, Quantities. All IDs have the same exact start date, however some have starting quantities of 0 so basically the start date should be later.

I am trying to filter the rows such that each ID

df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
                        "2", "2", "2", "2", "2", "2",
                        "3", "3", "3", "3", "3", "3"),
                 Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
                          seq(as.Date("2000/1/1"), by = "month", length.out = 6),
                          seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
                 QTY = c(0, 0, 0, 40, 0, 60, 50, 55, 0, 70, 88, 64, 0, 43, 43, 0, 99, 99))

So basically in the df example above, I want ID number 1 to start from the fourth QTY value, and for ID number 3 to start from the second QTY value.

I tried using group by and to mutate a new column with an ifelse but I had some error.

My expected df should be a transformation from the first one to the one below, so just to remove the first rows with zeroes for each ID until QTY starts to be more than 0

df <- data.frame(ID = c("1", "1", "1",
                        "2", "2", "2", "2", "2", "2",
                        "3", "3", "3", "3", "3"),
                 Date = c(seq(as.Date("2000/4/1"), by = "month", length.out = 3),
                          seq(as.Date("2000/1/1"), by = "month", length.out = 6),
                          seq(as.Date("2000/2/1"), by = "month", length.out = 5)),
                 QTY = c(40, 0, 60, 50, 55, 0, 70, 88, 64, 43, 43, 0, 99, 99))

Upvotes: 3

Views: 1005

Answers (2)

arg0naut91
arg0naut91

Reputation: 14764

If the data is as in your example (i.e. no NA or negative values), just this base solution would suffice:

df[with(df, ave(QTY, ID, FUN = cumsum)) > 0, ]

Or subset(df, ave(QTY, ID, FUN = cumsum) > 0), but likely a tiny bit slower.

Output:

   ID       Date QTY
4   1 2000-04-01  40
5   1 2000-05-01   0
6   1 2000-06-01  60
7   2 2000-01-01  50
8   2 2000-02-01  55
9   2 2000-03-01   0
10  2 2000-04-01  70
11  2 2000-05-01  88
12  2 2000-06-01  64
14  3 2000-02-01  43
15  3 2000-03-01  43
16  3 2000-04-01   0
17  3 2000-05-01  99
18  3 2000-06-01  99

This is the same as in dplyr:

library(dplyr)

df %>%
  group_by(ID) %>%
  filter(cumsum(QTY) > 0)

Upvotes: 3

stefan
stefan

Reputation: 124183

Try this. Basic idea is compute the cumsum of non-zero QTY for each ID over time. Then filter out obs with cumsum == 0:

library(dplyr)

df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
                        "2", "2", "2", "2", "2", "2",
                        "3", "3", "3", "3", "3", "3"),
                 Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
                          seq(as.Date("2000/1/1"), by = "month", length.out = 6),
                          seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
                 QTY = c(0, 0, 0, 40, 0, 60, 50, 55, 0, 70, 88, 64, 0, 43, 43, 0, 99, 99))

df %>% 
  arrange(ID, Date) %>% 
  group_by(ID) %>% 
  mutate(cum_non_zero = cumsum(QTY > 0)) %>% 
  filter(cum_non_zero > 0)
#> # A tibble: 14 x 4
#> # Groups:   ID [3]
#>    ID    Date         QTY cum_non_zero
#>    <fct> <date>     <dbl>        <int>
#>  1 1     2000-04-01    40            1
#>  2 1     2000-05-01     0            1
#>  3 1     2000-06-01    60            2
#>  4 2     2000-01-01    50            1
#>  5 2     2000-02-01    55            2
#>  6 2     2000-03-01     0            2
#>  7 2     2000-04-01    70            3
#>  8 2     2000-05-01    88            4
#>  9 2     2000-06-01    64            5
#> 10 3     2000-02-01    43            1
#> 11 3     2000-03-01    43            2
#> 12 3     2000-04-01     0            2
#> 13 3     2000-05-01    99            3
#> 14 3     2000-06-01    99            4

Created on 2020-03-31 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions