Reputation: 29
I have this dataset:
structure(list(id = c(2004938L, 2107410L, 2119255L, 2129457L,
2141169L, 2172051L), date = structure(c(17725, 17732, 17733,
17734, 17734, 17736), class = "Date"), hour = c(20, 22, 18, 12,
21, 22), store_name = c("Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998"
), area = c("Indiranagar, EGL", "Indiranagar, EGL", "Indiranagar, EGL",
"Indiranagar, EGL", "Indiranagar, EGL", "Indiranagar, EGL"),
amount = c(900, 2400, 2700, 380, 150, 100)), row.names = c(6264L,
10841L, 11355L, 11892L, 12348L, 13570L), class = "data.frame")
Let's call this "e".
I would like to summarize it as follows:
f = e %>%
dplyr::group_by(date, store_name, area) %>%
dplyr::summarize(amount = sum(amount, na.rm = TRUE), amount_after_8 = sum(amount[hour >= 20], na.rm = TRUE))
This gives the output "f" as:
structure(list(date = structure(c(17725, 17732, 17733, 17734,
17736), class = "Date"), store_name = c("Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998",
"Www Cigarsindia In India S Largest And Trusted Online Cigar Store Since 1998"
), area = c("Indiranagar, EGL", "Indiranagar, EGL", "Indiranagar, EGL",
"Indiranagar, EGL", "Indiranagar, EGL"), amount = c(900, 2400,
2700, 530, 100), amount_after_8 = c(900, 2400, 0, 0, 100)), row.names = c(NA,
-5L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("date",
"store_name"), drop = TRUE)
Now this output is wrong because the 5th row in "e" contains an amount value of 150 which also satisfies the condition of hour >= 20, but it is showing as 0 in the output dataset "f".
What am I doing wrong here?
Upvotes: 3
Views: 225
Reputation: 4298
The following would work:
e %>%
dplyr::group_by(date, store_name, area) %>%
dplyr::summarize(
amount_after_8 = sum(amount[hour >= 20], na.rm = TRUE), amount = sum(amount, na.rm = TRUE)
)
The problem is that summarize
works sequentially so that amount
is already a summarized output by the time it gets to amount_after_8
.
Upvotes: 2