Reputation: 25
I have a dataset with 4 columns which looks like that:
City | Year | Week | Average |
---|---|---|---|
Guelph | 2020 | 2020-04-12 | 28.3 |
Hamilton | 2020 | 2020-04-12 | 10.7 |
Waterloo | 2020 | 2020-04-12 | 50.1 |
Guelph | 2020 | 2020-04-20 | 3.5 |
Hamilton | 2020 | 2020-04-20 | 42.9 |
I would like to sum the average column for the same week. In other words, I want to create a new dataset with three columns (Year, week, Average) where I won't have 3 different rows for the same week but only one (e.g instead of having three times 20220-04-12, I will have it one) and the corresponding cell in the average column will be the sum of all the rows that correspond to the same week. Something like that:
Year | Week | Average |
---|---|---|
2020 | 2020-04-12 | 89.1 |
2020 | 2020-04-20 | 46.4 |
where 89.1 is the sum of the first three rows that are at the same week and 46.4 is the sum of the last two rows of the initial table that correspond to the same week (2020-04-20).
The code I am using for that looks like that:
data_set <- data_set %>%
select(`Year`, `Week`, `Average`) %>%
group_by(Year, Week) %>%
summarize(Average = sum(Average))
but for some weeks I am getting back NAs and for some other I get the correct sum I want. The data are all numeric and in the initial dataset there are some NA values on the Average column.
Thanks in advance
Upvotes: 0
Views: 81
Reputation: 2849
You can accomplish this by passing in na.rm = TRUE
to sum
. Also, since you group_by(Year, Week)
, there isn't much to gain with using select
in this case since you are generating a summary statistic on the Average
variable within summarise
.
df <- structure(list(City = c("Guelph", "Hamilton", "Waterloo", "Guelph",
"Hamilton"), Year = c(2020L, 2020L, 2020L, 2020L, 2020L), Week = c("2020-04-12",
"2020-04-12", "2020-04-12", "2020-04-20", "2020-04-20"), Average = c(28.3,
10.7, 50.1, 3.5, 42.9)), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
df %>%
mutate(
Week = as.Date(Week),
) %>%
group_by(Year, Week) %>%
summarise(
Average = sum(Average, na.rm = TRUE)
)
#> # A tibble: 2 x 3
#> # Groups: Year [1]
#> Year Week Average
#> <int> <date> <dbl>
#> 1 2020 2020-04-12 89.1
#> 2 2020 2020-04-20 46.4
Created on 2021-03-10 by the reprex package (v0.3.0)
Upvotes: 1