Reputation: 2085
I've got a data frame like below:
data <- structure(list(type = c("LOCAL", "LOCAL", "LOCAL", "LOCAL", "LOCAL",
"LOCAL"), start_date = structure(c(17416, 17255, 17331, 17331,
17269, 17402), class = "Date"), end_date = structure(c(17422,
17261, 17338, 17338, 17275, 17408), class = "Date"), var1 = c(9e-04,
9e-04, 7e-04, 6e-04, 7e-04, 9e-04)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L))
For each day of 2017, I need to calculate how many observations had it between start_date
and end_date
and what is the sum of var1
for these observations (that had the day between start_date
and end_date
). Therefore, I would like to obtain a data frame with all days in 2017, a column that counts for how many rows particular date is between start_date
and end_date
in original data frame, and finally a column with sum of var1
for particular day. Therefore, the output should look like below:
structure(list(date = structure(c(17167, 17168, 17169, 17170,
17171, 17172, 17173, 17174, 17175, 17176, 17177, 17178, 17179,
17180, 17181, 17182, 17183, 17184, 17185, 17186, 17187, 17188,
17189, 17190, 17191, 17192, 17193, 17194, 17195, 17196, 17197,
17198, 17199, 17200, 17201, 17202, 17203, 17204, 17205, 17206,
17207, 17208, 17209, 17210, 17211, 17212, 17213, 17214, 17215,
17216, 17217, 17218, 17219, 17220, 17221, 17222, 17223, 17224,
17225, 17226, 17227, 17228, 17229, 17230, 17231, 17232, 17233,
17234, 17235, 17236, 17237, 17238, 17239, 17240, 17241, 17242,
17243, 17244, 17245, 17246, 17247, 17248, 17249, 17250, 17251,
17252, 17253, 17254, 17255, 17256, 17257, 17258, 17259, 17260,
17261, 17262, 17263, 17264, 17265, 17266, 17267, 17268, 17269,
17270, 17271, 17272, 17273, 17274, 17275, 17276, 17277, 17278,
17279, 17280, 17281, 17282, 17283, 17284, 17285, 17286, 17287,
17288, 17289, 17290, 17291, 17292, 17293, 17294, 17295, 17296,
17297, 17298, 17299, 17300, 17301, 17302, 17303, 17304, 17305,
17306, 17307, 17308, 17309, 17310, 17311, 17312, 17313, 17314,
17315, 17316, 17317, 17318, 17319, 17320, 17321, 17322, 17323,
17324, 17325, 17326, 17327, 17328, 17329, 17330, 17331, 17332,
17333, 17334, 17335, 17336, 17337, 17338, 17339, 17340, 17341,
17342, 17343, 17344, 17345, 17346, 17347, 17348, 17349, 17350,
17351, 17352, 17353, 17354, 17355, 17356, 17357, 17358, 17359,
17360, 17361, 17362, 17363, 17364, 17365, 17366, 17367, 17368,
17369, 17370, 17371, 17372, 17373, 17374, 17375, 17376, 17377,
17378, 17379, 17380, 17381, 17382, 17383, 17384, 17385, 17386,
17387, 17388, 17389, 17390, 17391, 17392, 17393, 17394, 17395,
17396, 17397, 17398, 17399, 17400, 17401, 17402, 17403, 17404,
17405, 17406, 17407, 17408, 17409, 17410, 17411, 17412, 17413,
17414, 17415, 17416, 17417, 17418, 17419, 17420, 17421, 17422,
17423, 17424, 17425, 17426, 17427, 17428, 17429, 17430, 17431,
17432, 17433, 17434, 17435, 17436, 17437, 17438, 17439, 17440,
17441, 17442, 17443, 17444, 17445, 17446, 17447, 17448, 17449,
17450, 17451, 17452, 17453, 17454, 17455, 17456, 17457, 17458,
17459, 17460, 17461, 17462, 17463, 17464, 17465, 17466, 17467,
17468, 17469, 17470, 17471, 17472, 17473, 17474, 17475, 17476,
17477, 17478, 17479, 17480, 17481, 17482, 17483, 17484, 17485,
17486, 17487, 17488, 17489, 17490, 17491, 17492, 17493, 17494,
17495, 17496, 17497, 17498, 17499, 17500, 17501, 17502, 17503,
17504, 17505, 17506, 17507, 17508, 17509, 17510, 17511, 17512,
17513, 17514, 17515, 17516, 17517, 17518, 17519, 17520, 17521,
17522, 17523, 17524, 17525, 17526, 17527, 17528, 17529, 17530,
17531), class = "Date"), count = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
sum = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 9e-04, 9e-04,
9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0, 0, 0, 7e-04, 7e-04,
7e-04, 7e-04, 7e-04, 7e-04, 7e-04, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0013, 0.0013, 0.0013, 0.0013,
0.0013, 0.0013, 0.0013, 0.0013, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 9e-04,
9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0, 0, 0, 9e-04,
9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-365L))
My approach was to create an additional counts
data frame, concatenate it to original data frame and then for each date column check if the date is between start_date
and end_date
. Then gather()
, and group_by() + summarise()
should give us expected output. However, the following code
data %>%
bind_cols(counts) %>%
mutate_at(vars(matches('[0-9]+\\-[0-9]+\\-[0-9]+')), ~between(ymd(names(.)), start_date, end_date))
ends up in error:
Error: Expecting a single value: [extent=6].
I can't understand what's going wrong here.
Upvotes: 0
Views: 143
Reputation: 388807
We could create a dataframe for entire year (yearly_data
)
yearly_data <- data.frame(date = seq(as.Date("2017-01-01"),
as.Date("2017-12-31"), by = "1 day"))
Expand data
by creating a sequence between start_date
and end_date
and do a left_join
with yearly_data
, group_by
date
and count the number of rows for each date
and sum
var1
.
library(dplyr)
yearly_data %>%
left_join(data %>%
tidyr::unnest(date = purrr::map2(start_date, end_date, seq, by = "1 day")) %>%
select(-start_date, -end_date), by = "date") %>%
group_by(date) %>%
summarise(count = sum(var1 != 0, na.rm = TRUE),
var1 = sum(var1, na.rm = TRUE))
Upvotes: 1
Reputation: 7724
One approach with tidyr
and dplyr
functions is
my_df %>%
mutate(id = 1:n()) %>%
gather(start_end, date, start_date, end_date) %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by = "1 day")) %>%
fill(var1) %>%
ungroup() %>%
complete(date = seq.Date(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "1 day")) %>%
group_by(date) %>%
summarize(count = sum(!is.na(var1)),
sum = sum(var1, na.rm = TRUE))
Note that I use complete twice. First within each id I get all the day between start and end date and fill the missing values with the var1
. Then I use complete
again to gt all days of 2017. Then count is simply the number of non-missing var1
values for each day and sum
the sum of them.
Upvotes: 2