Reputation: 431
I have a dataframe that looks like this:
It contains the daily data for var1, var2, var3 for city A,B,C from 2020-01-01 to 2020-12-31.
+-----------+------------+------+------+------+
| Geography | Dates | var1 | var2 | var3 |
+-----------+------------+------+------+------+
| A | 2020-01-01 | 10 | 100 | 1 |
+-----------+------------+------+------+------+
| A | 2020-01-02 | 20 | 200 | 2 |
+-----------+------------+------+------+------+
| A | 2020-01-03 | 30 | 300 | 3 |
+-----------+------------+------+------+------+
| A | ... | ... | ... | ... |
+-----------+------------+------+------+------+
| A | 2020-12-31 | 50 | 500 | 5 |
+-----------+------------+------+------+------+
| B | 2020-01-01 | 60 | 600 | 6 |
+-----------+------------+------+------+------+
| B | 2020-01-02 | 70 | 700 | 7 |
+-----------+------------+------+------+------+
| B | 2020-01-03 | 50 | 500 | 5 |
+-----------+------------+------+------+------+
| B | ... | ... | ... | ... |
+-----------+------------+------+------+------+
| B | 2020-12-31 | 60 | 600 | 6 |
+-----------+------------+------+------+------+
| C | 2020-01-01 | 30 | 300 | 3 |
+-----------+------------+------+------+------+
| C | 2020-01-02 | 40 | 400 | 4 |
+-----------+------------+------+------+------+
| C | 2020-01-03 | 50 | 500 | 5 |
+-----------+------------+------+------+------+
| C | ... | ... | ... | ... |
+-----------+------------+------+------+------+
| C | 2020-12-31 | 70 | 700 | 7 |
+-----------+------------+------+------+------+
I would like to divide this data into 3 time segments:
And calculate the average for each variable, for each of those time segments, for each city, respectively.
The ending dataframe would like this:
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| Geography | var1_seg1_avg | var1_seg2_avg | var1_seg3_avg | var2_seg1_avg | var2_seg2_avg | var2_seg3_avg | var3_seg1_avg | var3_seg2_avg | var3_seg3_avg |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| A | xx | xx | xx | xx | xx | xx | xx | xx | xx |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| B | xx | xx | xx | xx | xx | xx | xx | xx | xx |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| C | xx | xx | xx | xx | xx | xx | xx | xx | xx |
+-----------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
What is an elegant way to do this using tidyverse / dplyr?
Much appreciation for your help!
Upvotes: 0
Views: 117
Reputation: 39613
You can also try:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
#Create group
mutate(Dates=as.Date(Dates),
Group=ifelse(Dates>=as.Date('2020-01-01') & Dates<=as.Date('2020-05-30'),'Seg1',
ifelse(Dates>=as.Date('2020-06-01') & Dates<=as.Date('2020-08-31'),'Seg2',
ifelse(Dates>=as.Date('2020-09-01') & Dates<=as.Date('2020-12-31'),'Seg3',NA)))) %>%
group_by(Geography,Group) %>%
select(-Dates) %>%
summarise_all(mean,na.rm=T) %>%
pivot_wider(names_from = Group,values_from=c(var1:var3))
Output:
# A tibble: 3 x 7
# Groups: Geography [3]
Geography var1_Seg1 var1_Seg3 var2_Seg1 var2_Seg3 var3_Seg1 var3_Seg3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 20 50 200 500 2 5
2 B 60 60 600 600 6 6
3 C 40 70 400 700 4 7
Althought, @akrun solution is more efficient in terms of creating the groups and avoid conditionals.
Some data used:
#Data
df <- structure(list(Geography = c("A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), Dates = c(" 2020-01-01 ", " 2020-01-02 ",
" 2020-01-03 ", " 2020-12-31 ", " 2020-01-01 ", " 2020-01-02 ",
" 2020-01-03 ", " 2020-12-31 ", " 2020-01-01 ", " 2020-01-02 ",
" 2020-01-03 ", " 2020-12-31 "), var1 = c(10L, 20L, 30L, 50L,
60L, 70L, 50L, 60L, 30L, 40L, 50L, 70L), var2 = c(100L, 200L,
300L, 500L, 600L, 700L, 500L, 600L, 300L, 400L, 500L, 700L),
var3 = c(1L, 2L, 3L, 5L, 6L, 7L, 5L, 6L, 3L, 4L, 5L, 7L)), row.names = c(2L,
4L, 6L, 9L, 11L, 13L, 15L, 18L, 20L, 22L, 24L, 27L), class = "data.frame")
Upvotes: 3
Reputation: 887891
An option would be to cut
the 'Dates' to create a new column, do a group by mean
and pivot to wide format
library(dplyr)
library(tidyr)
df1 %>%
group_by(Geography,
seg = paste0("seg", as.integer(cut(as.Date(Dates),
breaks = c("2020-01-01", "2020-06-01", "2020-09-01", "2020-12-31")))) %>%
summarise(across(starts_with('var'), mean, na.rm = TRUE), .groups = 'drop') %>%
pivot_wider(names_from = seg, values_from = starts_with('var'))
Upvotes: 2