jakes
jakes

Reputation: 2085

Count days and variable values between two column dates

I've got a data frame like below:

data <- structure(list(type = c("LOCAL", "LOCAL", "LOCAL", "LOCAL", "LOCAL", 
"LOCAL"), start_date = structure(c(17416, 17255, 17331, 17331, 
17269, 17402), class = "Date"), end_date = structure(c(17422, 
17261, 17338, 17338, 17275, 17408), class = "Date"), var1 = c(9e-04, 
9e-04, 7e-04, 6e-04, 7e-04, 9e-04)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L))

For each day of 2017, I need to calculate how many observations had it between start_date and end_date and what is the sum of var1 for these observations (that had the day between start_date and end_date). Therefore, I would like to obtain a data frame with all days in 2017, a column that counts for how many rows particular date is between start_date and end_date in original data frame, and finally a column with sum of var1 for particular day. Therefore, the output should look like below:

structure(list(date = structure(c(17167, 17168, 17169, 17170, 
17171, 17172, 17173, 17174, 17175, 17176, 17177, 17178, 17179, 
17180, 17181, 17182, 17183, 17184, 17185, 17186, 17187, 17188, 
17189, 17190, 17191, 17192, 17193, 17194, 17195, 17196, 17197, 
17198, 17199, 17200, 17201, 17202, 17203, 17204, 17205, 17206, 
17207, 17208, 17209, 17210, 17211, 17212, 17213, 17214, 17215, 
17216, 17217, 17218, 17219, 17220, 17221, 17222, 17223, 17224, 
17225, 17226, 17227, 17228, 17229, 17230, 17231, 17232, 17233, 
17234, 17235, 17236, 17237, 17238, 17239, 17240, 17241, 17242, 
17243, 17244, 17245, 17246, 17247, 17248, 17249, 17250, 17251, 
17252, 17253, 17254, 17255, 17256, 17257, 17258, 17259, 17260, 
17261, 17262, 17263, 17264, 17265, 17266, 17267, 17268, 17269, 
17270, 17271, 17272, 17273, 17274, 17275, 17276, 17277, 17278, 
17279, 17280, 17281, 17282, 17283, 17284, 17285, 17286, 17287, 
17288, 17289, 17290, 17291, 17292, 17293, 17294, 17295, 17296, 
17297, 17298, 17299, 17300, 17301, 17302, 17303, 17304, 17305, 
17306, 17307, 17308, 17309, 17310, 17311, 17312, 17313, 17314, 
17315, 17316, 17317, 17318, 17319, 17320, 17321, 17322, 17323, 
17324, 17325, 17326, 17327, 17328, 17329, 17330, 17331, 17332, 
17333, 17334, 17335, 17336, 17337, 17338, 17339, 17340, 17341, 
17342, 17343, 17344, 17345, 17346, 17347, 17348, 17349, 17350, 
17351, 17352, 17353, 17354, 17355, 17356, 17357, 17358, 17359, 
17360, 17361, 17362, 17363, 17364, 17365, 17366, 17367, 17368, 
17369, 17370, 17371, 17372, 17373, 17374, 17375, 17376, 17377, 
17378, 17379, 17380, 17381, 17382, 17383, 17384, 17385, 17386, 
17387, 17388, 17389, 17390, 17391, 17392, 17393, 17394, 17395, 
17396, 17397, 17398, 17399, 17400, 17401, 17402, 17403, 17404, 
17405, 17406, 17407, 17408, 17409, 17410, 17411, 17412, 17413, 
17414, 17415, 17416, 17417, 17418, 17419, 17420, 17421, 17422, 
17423, 17424, 17425, 17426, 17427, 17428, 17429, 17430, 17431, 
17432, 17433, 17434, 17435, 17436, 17437, 17438, 17439, 17440, 
17441, 17442, 17443, 17444, 17445, 17446, 17447, 17448, 17449, 
17450, 17451, 17452, 17453, 17454, 17455, 17456, 17457, 17458, 
17459, 17460, 17461, 17462, 17463, 17464, 17465, 17466, 17467, 
17468, 17469, 17470, 17471, 17472, 17473, 17474, 17475, 17476, 
17477, 17478, 17479, 17480, 17481, 17482, 17483, 17484, 17485, 
17486, 17487, 17488, 17489, 17490, 17491, 17492, 17493, 17494, 
17495, 17496, 17497, 17498, 17499, 17500, 17501, 17502, 17503, 
17504, 17505, 17506, 17507, 17508, 17509, 17510, 17511, 17512, 
17513, 17514, 17515, 17516, 17517, 17518, 17519, 17520, 17521, 
17522, 17523, 17524, 17525, 17526, 17527, 17528, 17529, 17530, 
17531), class = "Date"), count = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    sum = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 9e-04, 9e-04, 
    9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0, 0, 0, 7e-04, 7e-04, 
    7e-04, 7e-04, 7e-04, 7e-04, 7e-04, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0013, 0.0013, 0.0013, 0.0013, 
    0.0013, 0.0013, 0.0013, 0.0013, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 9e-04, 
    9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0, 0, 0, 9e-04, 
    9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 9e-04, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-365L))

My approach was to create an additional counts data frame, concatenate it to original data frame and then for each date column check if the date is between start_date and end_date. Then gather(), and group_by() + summarise() should give us expected output. However, the following code

data %>% 
  bind_cols(counts) %>% 
  mutate_at(vars(matches('[0-9]+\\-[0-9]+\\-[0-9]+')), ~between(ymd(names(.)), start_date, end_date))

ends up in error:

Error: Expecting a single value: [extent=6].

I can't understand what's going wrong here.

Upvotes: 0

Views: 143

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388807

We could create a dataframe for entire year (yearly_data)

yearly_data <- data.frame(date = seq(as.Date("2017-01-01"), 
                                     as.Date("2017-12-31"), by = "1 day"))

Expand data by creating a sequence between start_date and end_date and do a left_join with yearly_data, group_by date and count the number of rows for each date and sum var1.

library(dplyr)

yearly_data %>%
   left_join(data %>%
   tidyr::unnest(date = purrr::map2(start_date, end_date, seq, by = "1 day")) %>%
   select(-start_date, -end_date), by = "date") %>%
   group_by(date) %>%
   summarise(count = sum(var1 != 0, na.rm = TRUE), 
             var1 = sum(var1, na.rm = TRUE))

Upvotes: 1

kath
kath

Reputation: 7724

One approach with tidyr and dplyr functions is

my_df %>% 
  mutate(id = 1:n()) %>% 
  gather(start_end, date, start_date, end_date) %>% 
  group_by(id) %>% 
  complete(date = seq.Date(min(date), max(date), by = "1 day")) %>% 
  fill(var1) %>% 
  ungroup() %>% 
  complete(date = seq.Date(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "1 day")) %>% 
  group_by(date) %>% 
  summarize(count = sum(!is.na(var1)), 
            sum = sum(var1, na.rm = TRUE))

Note that I use complete twice. First within each id I get all the day between start and end date and fill the missing values with the var1. Then I use complete again to gt all days of 2017. Then count is simply the number of non-missing var1 values for each day and sum the sum of them.

Upvotes: 2

Related Questions