Werrby
Werrby

Reputation: 41

R convert hourly data to daily but keep the categorical data

I currently have hourly data than I've converted to daily as below.

n <- 24
daily <- aggregate(day[,-1],
                   by = list(gl(ceiling(nrow(day[,-1])/n), n)[1:nrow(day[,-1])]),
                   FUN = mean)

However, I have 2 columns of categorical data which I would like R to keep. Right now there are 24 Summer then 24 Monday and etc for each day in these columns.

How do I go about keeping only Summer and Monday for observation 1, then Summer and Tuesday for observation 2 and etc.

see output here

Thank you.

Upvotes: 0

Views: 383

Answers (1)

ludvigolsen
ludvigolsen

Reputation: 181

I will provide two solutions. In the first one, you do it in two steps and join them. This is a good technique to know for more advanced cases. In the second solution, we do it in one step.

Here is some random data. The n columns are numeric and c columns are categorical.

set.seed(1)

# Create a dataset
data <- data.frame(
  n1 = runif(24 * 3),
  n2 = runif(24 * 3),
  c1 = rep(c("a", "b", "c"), each = 24),
  c2 = rep(c("summer", "winter", "fall"), each = 24)
)

Group into days

First, we greedily group the rows into days (first 24 rows == day 1, etc.). This is done with groupdata2's group() function and the greedy method.

# Greedily group rows into days
# Creates new column `day`
# Note that the output is grouped by the new `day` column
data <- data %>% 
  groupdata2::group(
    n = 24,
    method = "greedy",
    col_name = "day"
  )

> # A tibble: 72 x 5
> # Groups:   day [3]
>        n1    n2 c1    c2     day  
>     <dbl> <dbl> <fct> <fct>  <fct>
>  1 0.266  0.347 a     summer 1    
>  2 0.372  0.334 a     summer 1    
>  3 0.573  0.476 a     summer 1    
>  4 0.908  0.892 a     summer 1    
>  5 0.202  0.864 a     summer 1    
>  6 0.898  0.390 a     summer 1    
>  7 0.945  0.777 a     summer 1    
>  8 0.661  0.961 a     summer 1    
>  9 0.629  0.435 a     summer 1    
> 10 0.0618 0.713 a     summer 1    
> # … with 62 more rows

Solution 1 - two steps and join

  1. I find the first row of each day to get the categorical information.
  2. I get the mean of the numeric columns grouped by day.
  3. I join the output of 1) and 2).
# Extract categorical information
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `filter()`
day_info <- data %>% 
  dplyr::filter(dplyr::row_number() == 1) %>% 
  dplyr::select(-n1, -n2)

# Calculate means of the numeric columns
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `summarize()`
day_aggregates <- data %>% 
  dplyr::select(-c1, -c2) %>% 
  dplyr::summarise(across(where(is.numeric), mean))

# Join the two data frames
final_data <- day_aggregates %>% 
  dplyr::left_join(day_info, by = "day")

final_data

> # A tibble: 3 x 5
>   day      n1    n2 c1    c2    
>   <fct> <dbl> <dbl> <fct> <fct> 
> 1 1     0.543 0.520 a     summer
> 2 2     0.507 0.504 b     winter
> 3 3     0.492 0.505 c     fall

Solution 2 - 1 step

For the numeric columns, we get the mean. For the factor columns, we get the first value with head() in an anonymous function.

# Calculate means of numeric data and get first value of factors
# Again note that `data` is grouped by `day`
# otherwise you would run `dplyr::group_by(day)` before `summarize()`
final_data <- data %>% 
  dplyr::summarise(across(where(is.numeric), mean),
                   across(where(is.factor), function(x)head(x, 1)))

final_data

> # A tibble: 3 x 5
>   day      n1    n2 c1    c2    
>   <fct> <dbl> <dbl> <fct> <fct> 
> 1 1     0.543 0.520 a     summer
> 2 2     0.507 0.504 b     winter
> 3 3     0.492 0.505 c     fall  

Upvotes: 1

Related Questions