Reputation: 5907
Given a table, I am trying to learn how to use R for summing variables based on when a certain condition is met (based on other variables in the same table).
Using the dplyr library, (I created some data) and then summed the data by groups :
#load library
library(dplyr)
#create data
data <- data.frame(
"col_a" = c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa"),
"col_b" = c("123", "124", "125", "126", "127", "128", "129", "130"),
"col_c" = c("2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015"),
"col_d" = c("red", "red", "red", "blue", "blue", "green", "green", "green"),
"day_a" = c("2001-01-01", "2000-01-05", "2000-01-01", "2010-12-20", "2010-12-20", "2020-05-05", "2020-05-05", "2020-05-28"),
"day_b" = c("2001-01-10", "2000-01-10", "2000-01-01", "2010-12-25", "2010-12-22", "2020-05-15", "2020-05-20", "2020-05-30")
)
#format variable types
data$col_a = as.factor(data$col_a)
data$col_b = as.factor(data$col_b)
data$col_c = as.factor(data$col_c)
#format date variables
data$day_a = as.factor(data$day_a)
data$day_b = as.factor(data$day_b)
data$day_1 = as.Date(as.character(data$day_a))
data$day_2 = as.Date(as.character(data$day_b))
#create new variable based on difference between date variables
data$diff = data$day_2 - data$day_1
data$diff = as.numeric(data$diff)
#create file that sums days based on groups of "col_a, col_c, col_d"
file = data%>%
group_by(col_a, col_c, col_d) %>%
dplyr::summarize(Total = sum(diff, na.rm=TRUE), Count = n())
file = as.data.frame(file)
Now, for groups of "col_a, col_c, col_d", I want to sum the "diff" variable based on another condition.
For example, for the group "aaa, 2015, green", I want to only sum the "unique days" - i.e. the days that overlap. (2020-05-05, 2020-05-15), ( 2020-05-05, 2020-05-20), (2020-05-28 ,2020-05-30)
For this group, I want the value of the "total" variable = 15 + 2 = 17 ... instead of "27".
This is because the dates(2020-05-05, 2020-05-15) is completely within the dates ( 2020-05-05, 2020-05-20). I only want to sum the "unique" date periods.
I am trying to get something in the end that looks like this:
final_result <- data.frame ( col_a = c("aaa", "aaa", "aaa"),
col_c = c("2015", "2015", "2015"),
col_d = c("blue", "green", "red"),
total = c("5","17","9"),
count = c("2", "3", "3")
)
Can anyone please show me how to do this?
Thanks
Upvotes: 3
Views: 263
Reputation: 24790
Here's an approach with purrr::map2
:
First, convert the Date
columns to integer representations. Then use map2
to create vectors of the integer sequences between the two dates. It seems you don't want to count the last day, so I subtracted 1 from day 2
.
Now we have a new column dates
which holds a vector of dates as integers.
library(purrr)
data %>%
transmute(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq))
1 11323, 11324, 11325, 11326, 11327, 11328, 11329, 11330, 11331
2 10961, 10962, 10963, 10964, 10965
3 10957, 10956
4 14963, 14964, 14965, 14966, 14967
5 14963, 14964
6 18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396
7 18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396, 18397, 18398, 18399, 18400, 18401
8 18410, 18411
Then we can group as you did previously, and summarize by unlisting the dates for a specific group and using unique
to remove duplicates. Then just count up the number of dates.
data %>%
mutate(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq)) %>%
group_by(col_a, col_c, col_d) %>%
dplyr::summarize(Total = length(unique(unlist(dates))), Count = n())
# A tibble: 3 x 5
# Groups: col_a, col_c [1]
col_a col_c col_d Total Count
<fct> <fct> <chr> <int> <int>
1 aaa 2015 blue 5 2
2 aaa 2015 green 17 3
3 aaa 2015 red 16 3
Upvotes: 2