Purrr summarising one dataframe and using the result to add columns to another row wise

Question

The problem: I have two datasets that are in essence time sequences with something measured at each timepoint. Df1 has a sample point every 30 mins, while df2 has a sample point every 5 min. I want to join the two datasets but to do that I need to summarise df2 so it also has only 1 value per 30 min. However, these intervals are not always constant and therefore this is the solution I've come up to, but I need to then be able to do this for every row of df1. I think there are purrr functions that will allow me to map a custom function row wise, but I haven't quite figured out how to.

This is what I've done so far

0 - Packages

library(lubridate)
library(tidyverse)
library(wakefield)
library(magrittr)

1 - Generate dummy datasets for example

df1 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='30 min'), "%d-%m-%Y %H:%M:%S"), 
              other_vars = upper(6, k = 5, x = LETTERS, prob = NULL, name = "Upper"))


df2 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='5 min'), "%d-%m-%Y %H:%M:%S"),
              num_var1 = runif(31, 0, 255), 
              num_var2 = runif(31, 0, 255))

2 - Add time at row i+1 as new column in df1

df1 %<>% 
  mutate(date_time_lead = lead(date_time))

3 - Subset rows of df2 with time points between time i and i+1 (in this case i == 1) of df1 and summarise the values (mean, sd, sum)

df2_sub <- df2  %>% 
  filter(date_time >= df1$date_time[1] & date_time < df1$date_time_lead[1]) %>% 
  summarise(var1_mean = mean(num_var1, na.rm = T), 
            var1_sd = sd(num_var1, na.rm = T), 
            var1_sum = sum(num_var1, na.rm = T),
            var2_mean = mean(num_var2, na.rm = T), 
            var2_sd = sd(num_var2, na.rm = T), 
            var2_sum = sum(num_var2, na.rm = T))

Now, ideally I'd put all of this in a custom function and then use one of the purrr functions to apply that function to each row of df1, but I haven't managed to figure out how to, so this is the next step

4 - Generate new columns in df1

df1$var1_mean <- NA
df1$var1_sd <- NA
df1$var1_sum <- NA
df1$var2_mean <- NA
df1$var2_sd <- NA
df1$var2_sum <- NA

5 - Add the newly generated summarised values to the ith row (in this case the first) of df1

df1[1, 4:9] <- df2_sub[1,]

Wimpel · Accepted Answer

Are you looking for something like this?

Hard to tell if it is correct, since your sample data is not minimal, and I do not get the use of logical operators in the line:
date_time > df1$date_time[1] & date_time < df1$date_time_lead[1])
Which suggests that in the interval 0:00-0:30 you want to exclude the values of 0:00 and 0:30 ?

library(data.table)
# set to data.table format
setDT(df1)
setDT(df2)
# perform rolling join, to roll down to the nearest date_from 
df2[, c("date_time_df1", "other_vars") := df1[df2, .(x.date_time, x.other_vars), on = .(date_time), roll = Inf]]
# melt to long
df2.melt <- melt(df2, measure.vars = patterns("^num_var"))
# summarise
df2.melt[, .(mean = mean(value, na.rm = TRUE),
             sd = sd(value, na.rm = TRUE),
             sum = sum(value, na.rm = TRUE)), 
         by = .(date_time = date_time_df1, other_vars, variable)]
#                  date_time other_vars variable      mean       sd       sum
#     1: 01-01-2021 00:00:00          B num_var1 183.51225 57.36997 1101.0735
#     2: 01-01-2021 00:30:00          B num_var1  96.46239 69.12586  578.7744
#     3: 01-01-2021 01:00:00          D num_var1 163.74048 84.57741  982.4429
#     4: 01-01-2021 01:30:00          A num_var1 120.74330 41.14782  724.4598
#     5: 01-01-2021 02:00:00          E num_var1 109.74861 85.56809  658.4917
# ---                                                                     
# 34942: 30-12-2021 22:00:00          E num_var2 140.67153 79.74211  844.0292
# 34943: 30-12-2021 22:30:00          B num_var2 101.25896 75.61385  607.5538
# 34944: 30-12-2021 23:00:00          E num_var2  77.86649 50.45125  467.1989
# 34945: 30-12-2021 23:30:00          D num_var2 152.96687 66.00130  917.8012
# 34946: 31-12-2021 00:00:00          E num_var2 245.89731       NA  245.8973

tidyverse solution

using dplyr >= v1.1.0

df2 %>%
  left_join(df1, join_by(closest(date_time >= date_time))) %>%
  group_by(date_time = date_time.y) %>%
  summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
# A tibble: 6 × 7
#   date_time           num_var1_mean num_var1_sd num_var1_sum num_var2_mean num_var2_sd num_var2_sum
#                                                                 
# 1 01-01-2021 00:00:00         149.         88.6         896.         116.         91.5         697.
# 2 01-01-2021 00:30:00         122.         60.8         731.          71.8        71.8         431.
# 3 01-01-2021 01:00:00         114.         86.5         681.         165.         75.9         988.
# 4 01-01-2021 01:30:00         187.         61.9        1120.         156.         81.2         936.
# 5 01-01-2021 02:00:00          87.2        86.2         523.         130.         66.9         781.
# 6 01-01-2021 02:30:00         125.         NA           125.         251.         NA           251.

Purrr summarising one dataframe and using the result to add columns to another row wise

Answers (2)

Related Questions