stats_noob
stats_noob

Reputation: 5907

R: Sum variables based on some condition

Given a table, I am trying to learn how to use R for summing variables based on when a certain condition is met (based on other variables in the same table).

Using the dplyr library, (I created some data) and then summed the data by groups :

#load library
library(dplyr)

#create data
data <- data.frame(
    
    "col_a" = c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa"),
    "col_b" = c("123", "124", "125", "126", "127", "128", "129", "130"),
"col_c" = c("2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015"),
"col_d" = c("red", "red", "red", "blue", "blue", "green", "green", "green"),
"day_a" = c("2001-01-01", "2000-01-05", "2000-01-01", "2010-12-20", "2010-12-20", "2020-05-05", "2020-05-05", "2020-05-28"),
"day_b" = c("2001-01-10", "2000-01-10", "2000-01-01", "2010-12-25", "2010-12-22", "2020-05-15", "2020-05-20", "2020-05-30")

)

#format variable types

data$col_a = as.factor(data$col_a)
data$col_b = as.factor(data$col_b)
data$col_c = as.factor(data$col_c)

#format date variables
data$day_a = as.factor(data$day_a)
data$day_b = as.factor(data$day_b)

data$day_1 = as.Date(as.character(data$day_a))
data$day_2 = as.Date(as.character(data$day_b))

#create new variable based on difference between date variables
data$diff = data$day_2 - data$day_1
data$diff = as.numeric(data$diff)

#create file that sums days based on groups of "col_a, col_c, col_d"
file = data%>%
    group_by(col_a, col_c, col_d) %>% 
    dplyr::summarize(Total = sum(diff, na.rm=TRUE), Count = n()) 

file = as.data.frame(file)

Now, for groups of "col_a, col_c, col_d", I want to sum the "diff" variable based on another condition.

For example, for the group "aaa, 2015, green", I want to only sum the "unique days" - i.e. the days that overlap. (2020-05-05, 2020-05-15), ( 2020-05-05, 2020-05-20), (2020-05-28 ,2020-05-30)

For this group, I want the value of the "total" variable = 15 + 2 = 17 ... instead of "27".

This is because the dates(2020-05-05, 2020-05-15) is completely within the dates ( 2020-05-05, 2020-05-20). I only want to sum the "unique" date periods.

I am trying to get something in the end that looks like this:

final_result <- data.frame ( col_a = c("aaa", "aaa", "aaa"),
                             col_c = c("2015", "2015", "2015"), 
                             col_d = c("blue", "green", "red"),
                             total = c("5","17","9"),
count = c("2", "3", "3")
                                 
    )

Can anyone please show me how to do this?

Thanks

Upvotes: 3

Views: 263

Answers (1)

Ian Campbell
Ian Campbell

Reputation: 24790

Here's an approach with purrr::map2:

First, convert the Date columns to integer representations. Then use map2 to create vectors of the integer sequences between the two dates. It seems you don't want to count the last day, so I subtracted 1 from day 2.

Now we have a new column dates which holds a vector of dates as integers.

library(purrr)
data %>% 
  transmute(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq))
1                                           11323, 11324, 11325, 11326, 11327, 11328, 11329, 11330, 11331
2                                                                       10961, 10962, 10963, 10964, 10965
3                                                                                            10957, 10956
4                                                                       14963, 14964, 14965, 14966, 14967
5                                                                                            14963, 14964
6                                    18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396
7 18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396, 18397, 18398, 18399, 18400, 18401
8                                                                                            18410, 18411

Then we can group as you did previously, and summarize by unlisting the dates for a specific group and using unique to remove duplicates. Then just count up the number of dates.

data %>% 
  mutate(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq)) %>%
  group_by(col_a, col_c, col_d) %>%
  dplyr::summarize(Total = length(unique(unlist(dates))), Count = n())
# A tibble: 3 x 5
# Groups:   col_a, col_c [1]
  col_a col_c col_d Total Count
  <fct> <fct> <chr> <int> <int>
1 aaa   2015  blue      5     2
2 aaa   2015  green    17     3
3 aaa   2015  red      16     3

Upvotes: 2

Related Questions