Reputation: 303
I have a data set of different IDs, Date, Quantities. All IDs have the same exact start date, however some have starting quantities of 0 so basically the start date should be later.
I am trying to filter the rows such that each ID
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2",
"3", "3", "3", "3", "3", "3"),
Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
QTY = c(0, 0, 0, 40, 0, 60, 50, 55, 0, 70, 88, 64, 0, 43, 43, 0, 99, 99))
So basically in the df example above, I want ID number 1 to start from the fourth QTY value, and for ID number 3 to start from the second QTY value.
I tried using group by and to mutate a new column with an ifelse but I had some error.
My expected df should be a transformation from the first one to the one below, so just to remove the first rows with zeroes for each ID until QTY starts to be more than 0
df <- data.frame(ID = c("1", "1", "1",
"2", "2", "2", "2", "2", "2",
"3", "3", "3", "3", "3"),
Date = c(seq(as.Date("2000/4/1"), by = "month", length.out = 3),
seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/2/1"), by = "month", length.out = 5)),
QTY = c(40, 0, 60, 50, 55, 0, 70, 88, 64, 43, 43, 0, 99, 99))
Upvotes: 3
Views: 1005
Reputation: 14764
If the data is as in your example (i.e. no NA
or negative values), just this base
solution would suffice:
df[with(df, ave(QTY, ID, FUN = cumsum)) > 0, ]
Or subset(df, ave(QTY, ID, FUN = cumsum) > 0)
, but likely a tiny bit slower.
Output:
ID Date QTY
4 1 2000-04-01 40
5 1 2000-05-01 0
6 1 2000-06-01 60
7 2 2000-01-01 50
8 2 2000-02-01 55
9 2 2000-03-01 0
10 2 2000-04-01 70
11 2 2000-05-01 88
12 2 2000-06-01 64
14 3 2000-02-01 43
15 3 2000-03-01 43
16 3 2000-04-01 0
17 3 2000-05-01 99
18 3 2000-06-01 99
This is the same as in dplyr
:
library(dplyr)
df %>%
group_by(ID) %>%
filter(cumsum(QTY) > 0)
Upvotes: 3
Reputation: 124183
Try this. Basic idea is compute the cumsum
of non-zero QTY
for each ID
over time. Then filter out obs with cumsum == 0
:
library(dplyr)
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2",
"3", "3", "3", "3", "3", "3"),
Date = c(seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6),
seq(as.Date("2000/1/1"), by = "month", length.out = 6)),
QTY = c(0, 0, 0, 40, 0, 60, 50, 55, 0, 70, 88, 64, 0, 43, 43, 0, 99, 99))
df %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(cum_non_zero = cumsum(QTY > 0)) %>%
filter(cum_non_zero > 0)
#> # A tibble: 14 x 4
#> # Groups: ID [3]
#> ID Date QTY cum_non_zero
#> <fct> <date> <dbl> <int>
#> 1 1 2000-04-01 40 1
#> 2 1 2000-05-01 0 1
#> 3 1 2000-06-01 60 2
#> 4 2 2000-01-01 50 1
#> 5 2 2000-02-01 55 2
#> 6 2 2000-03-01 0 2
#> 7 2 2000-04-01 70 3
#> 8 2 2000-05-01 88 4
#> 9 2 2000-06-01 64 5
#> 10 3 2000-02-01 43 1
#> 11 3 2000-03-01 43 2
#> 12 3 2000-04-01 0 2
#> 13 3 2000-05-01 99 3
#> 14 3 2000-06-01 99 4
Created on 2020-03-31 by the reprex package (v0.3.0)
Upvotes: 2