Reputation: 41
I currently have hourly data than I've converted to daily as below.
n <- 24
daily <- aggregate(day[,-1],
by = list(gl(ceiling(nrow(day[,-1])/n), n)[1:nrow(day[,-1])]),
FUN = mean)
However, I have 2 columns of categorical data which I would like R to keep. Right now there are 24 Summer then 24 Monday and etc for each day in these columns.
How do I go about keeping only Summer and Monday for observation 1, then Summer and Tuesday for observation 2 and etc.
Thank you.
Upvotes: 0
Views: 383
Reputation: 181
I will provide two solutions. In the first one, you do it in two steps and join them. This is a good technique to know for more advanced cases. In the second solution, we do it in one step.
Here is some random data. The n
columns are numeric and c
columns are categorical.
set.seed(1)
# Create a dataset
data <- data.frame(
n1 = runif(24 * 3),
n2 = runif(24 * 3),
c1 = rep(c("a", "b", "c"), each = 24),
c2 = rep(c("summer", "winter", "fall"), each = 24)
)
First, we greedily group the rows into days (first 24 rows == day 1, etc.). This is done with groupdata2
's group()
function and the greedy
method.
# Greedily group rows into days
# Creates new column `day`
# Note that the output is grouped by the new `day` column
data <- data %>%
groupdata2::group(
n = 24,
method = "greedy",
col_name = "day"
)
> # A tibble: 72 x 5
> # Groups: day [3]
> n1 n2 c1 c2 day
> <dbl> <dbl> <fct> <fct> <fct>
> 1 0.266 0.347 a summer 1
> 2 0.372 0.334 a summer 1
> 3 0.573 0.476 a summer 1
> 4 0.908 0.892 a summer 1
> 5 0.202 0.864 a summer 1
> 6 0.898 0.390 a summer 1
> 7 0.945 0.777 a summer 1
> 8 0.661 0.961 a summer 1
> 9 0.629 0.435 a summer 1
> 10 0.0618 0.713 a summer 1
> # … with 62 more rows
# Extract categorical information
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `filter()`
day_info <- data %>%
dplyr::filter(dplyr::row_number() == 1) %>%
dplyr::select(-n1, -n2)
# Calculate means of the numeric columns
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `summarize()`
day_aggregates <- data %>%
dplyr::select(-c1, -c2) %>%
dplyr::summarise(across(where(is.numeric), mean))
# Join the two data frames
final_data <- day_aggregates %>%
dplyr::left_join(day_info, by = "day")
final_data
> # A tibble: 3 x 5
> day n1 n2 c1 c2
> <fct> <dbl> <dbl> <fct> <fct>
> 1 1 0.543 0.520 a summer
> 2 2 0.507 0.504 b winter
> 3 3 0.492 0.505 c fall
For the numeric columns, we get the mean. For the factor columns, we get the first value with head()
in an anonymous function.
# Calculate means of numeric data and get first value of factors
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `summarize()`
final_data <- data %>%
dplyr::summarise(across(where(is.numeric), mean),
across(where(is.factor), function(x)head(x, 1)))
final_data
> # A tibble: 3 x 5
> day n1 n2 c1 c2
> <fct> <dbl> <dbl> <fct> <fct>
> 1 1 0.543 0.520 a summer
> 2 2 0.507 0.504 b winter
> 3 3 0.492 0.505 c fall
Upvotes: 1