Virginia Morera Pujol
Virginia Morera Pujol

Reputation: 477

Purrr summarising one dataframe and using the result to add columns to another row wise

The problem: I have two datasets that are in essence time sequences with something measured at each timepoint. Df1 has a sample point every 30 mins, while df2 has a sample point every 5 min. I want to join the two datasets but to do that I need to summarise df2 so it also has only 1 value per 30 min. However, these intervals are not always constant and therefore this is the solution I've come up to, but I need to then be able to do this for every row of df1. I think there are purrr functions that will allow me to map a custom function row wise, but I haven't quite figured out how to.

This is what I've done so far

0 - Packages

library(lubridate)
library(tidyverse)
library(wakefield)
library(magrittr)

1 - Generate dummy datasets for example

df1 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='30 min'), "%d-%m-%Y %H:%M:%S"), 
              other_vars = upper(6, k = 5, x = LETTERS, prob = NULL, name = "Upper"))


df2 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='5 min'), "%d-%m-%Y %H:%M:%S"),
              num_var1 = runif(31, 0, 255), 
              num_var2 = runif(31, 0, 255))

2 - Add time at row i+1 as new column in df1

df1 %<>% 
  mutate(date_time_lead = lead(date_time))

3 - Subset rows of df2 with time points between time i and i+1 (in this case i == 1) of df1 and summarise the values (mean, sd, sum)

df2_sub <- df2  %>% 
  filter(date_time >= df1$date_time[1] & date_time < df1$date_time_lead[1]) %>% 
  summarise(var1_mean = mean(num_var1, na.rm = T), 
            var1_sd = sd(num_var1, na.rm = T), 
            var1_sum = sum(num_var1, na.rm = T),
            var2_mean = mean(num_var2, na.rm = T), 
            var2_sd = sd(num_var2, na.rm = T), 
            var2_sum = sum(num_var2, na.rm = T))

Now, ideally I'd put all of this in a custom function and then use one of the purrr functions to apply that function to each row of df1, but I haven't managed to figure out how to, so this is the next step

4 - Generate new columns in df1

df1$var1_mean <- NA
df1$var1_sd <- NA
df1$var1_sum <- NA
df1$var2_mean <- NA
df1$var2_sd <- NA
df1$var2_sum <- NA

5 - Add the newly generated summarised values to the ith row (in this case the first) of df1

df1[1, 4:9] <- df2_sub[1,]

Upvotes: 0

Views: 77

Answers (2)

Virginia Morera Pujol
Virginia Morera Pujol

Reputation: 477

Ok after downloading the development version of dplyr so I could use the join_by() function, I still get the following error:

>df2 %>%
  left_join(df1, join_by(closest(date_time >= date_time))) %>%
  group_by(date_time = date_time.y) %>%
  summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))

Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
production code), not a <dplyr_join_by> object.
Run `rlang::last_error()` to see where the error occurred.

Running rlang::last_error() gives

> rlang::last_error()
<error/rlang_error>
Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
production code), not a <dplyr_join_by> object.
---
Backtrace:
 1. ... %>% ...
 4. tidylog::left_join(., df1, join_by(closest(date_time >= date_time)))
 5. tidylog:::log_join(...)
 9. dplyr:::common_by.default(by = by, x = x, y = y)
Run `rlang::last_trace()` to see the full context.

Now, rlang::last_trace() gives

> rlang::last_trace()
<error/rlang_error>
Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
  production code), not a <dplyr_join_by> object.
---
Backtrace:
     ▆
  1. ├─... %>% ...
  2. ├─dplyr::summarise(...)
  3. ├─dplyr::group_by(., date_time = date_time.y)
  4. └─tidylog::left_join(., df1, join_by(closest(date_time >= date_time)))
  5.   └─tidylog:::log_join(...)
  6.     ├─base::suppressMessages(dplyr::common_by(by = by, x = x, y = y))
  7.     │ └─base::withCallingHandlers(...)
  8.     ├─dplyr::common_by(by = by, x = x, y = y)
  9.     └─dplyr:::common_by.default(by = by, x = x, y = y)
 10.       └─rlang::abort(msg)

Since you managed to run it with the exact same dataset I assume this is a version issue, I updated dplyr and rlang to the latest versions but I might have left sth behind

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] wakefield_0.3.6         raster_3.5-15           sp_1.4-7                magrittr_2.0.3         
 [5] suncalc_0.5.1           rnaturalearthdata_0.1.0 rnaturalearth_0.1.0     kableExtra_1.3.4       
 [9] amt_0.1.7               cowplot_1.1.1           tidylog_1.0.2           lubridate_1.8.0        
[13] sf_1.0-7                forcats_0.5.1           stringr_1.4.0           dplyr_1.1.0.9000       
[17] purrr_0.3.4             readr_2.1.2             tidyr_1.2.0             tibble_3.1.7           
[21] ggplot2_3.3.6           tidyverse_1.3.1        

loaded via a namespace (and not attached):
 [1] fs_1.5.2           usethis_2.1.6      devtools_2.4.3     webshot_0.5.3      httr_1.4.3        
 [6] rprojroot_2.0.3    tools_4.2.1        backports_1.4.1    utf8_1.2.2         R6_2.5.1          
[11] KernSmooth_2.23-20 DBI_1.1.2          colorspace_2.0-3   withr_2.5.0        tidyselect_1.2.0  
[16] prettyunits_1.1.1  processx_3.5.3     curl_4.3.2         compiler_4.2.1     cli_3.3.0         
[21] rvest_1.0.2        xml2_1.3.3         desc_1.4.1         scales_1.2.0       classInt_0.4-3    
[26] callr_3.7.0        proxy_0.4-26       systemfonts_1.0.4  digest_0.6.29      svglite_2.1.0     
[31] rmarkdown_2.14     htmltools_0.5.2    pkgconfig_2.0.3    sessioninfo_1.2.2  dbplyr_2.1.1      
[36] fastmap_1.1.0      rlang_1.0.6.9000   readxl_1.4.0       rstudioapi_0.13    generics_0.1.3    
[41] jsonlite_1.8.0     Matrix_1.4-1       Rcpp_1.0.8.3       munsell_0.5.0      fansi_1.0.3       
[46] lifecycle_1.0.3    terra_1.5-21       stringi_1.7.6      brio_1.1.3         pkgbuild_1.3.1    
[51] grid_4.2.1         crayon_1.5.1       lattice_0.20-45    haven_2.5.0        splines_4.2.1     
[56] hms_1.1.1          knitr_1.39         ps_1.7.0           pillar_1.8.1       codetools_0.2-18  
[61] clisymbols_1.2.0   pkgload_1.2.4      reprex_2.0.1       glue_1.6.2         evaluate_0.15     
[66] data.table_1.14.2  remotes_2.4.2      modelr_0.1.8       vctrs_0.5.2.9000   tzdb_0.3.0        
[71] Rdpack_2.3.1       testthat_3.1.4     cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1  
[76] cachem_1.0.6       xfun_0.31          rbibutils_2.2.8    broom_0.8.0        e1071_1.7-9       
[81] viridisLite_0.4.0  class_7.3-20       survival_3.3-1     memoise_2.0.1      units_0.8-0       
[86] ellipsis_0.3.2 

Upvotes: 0

Wimpel
Wimpel

Reputation: 27802

Are you looking for something like this?

Hard to tell if it is correct, since your sample data is not minimal, and I do not get the use of logical operators in the line:
date_time > df1$date_time[1] & date_time < df1$date_time_lead[1])
Which suggests that in the interval 0:00-0:30 you want to exclude the values of 0:00 and 0:30 ?

library(data.table)
# set to data.table format
setDT(df1)
setDT(df2)
# perform rolling join, to roll down to the nearest date_from 
df2[, c("date_time_df1", "other_vars") := df1[df2, .(x.date_time, x.other_vars), on = .(date_time), roll = Inf]]
# melt to long
df2.melt <- melt(df2, measure.vars = patterns("^num_var"))
# summarise
df2.melt[, .(mean = mean(value, na.rm = TRUE),
             sd = sd(value, na.rm = TRUE),
             sum = sum(value, na.rm = TRUE)), 
         by = .(date_time = date_time_df1, other_vars, variable)]
#                  date_time other_vars variable      mean       sd       sum
#     1: 01-01-2021 00:00:00          B num_var1 183.51225 57.36997 1101.0735
#     2: 01-01-2021 00:30:00          B num_var1  96.46239 69.12586  578.7744
#     3: 01-01-2021 01:00:00          D num_var1 163.74048 84.57741  982.4429
#     4: 01-01-2021 01:30:00          A num_var1 120.74330 41.14782  724.4598
#     5: 01-01-2021 02:00:00          E num_var1 109.74861 85.56809  658.4917
# ---                                                                     
# 34942: 30-12-2021 22:00:00          E num_var2 140.67153 79.74211  844.0292
# 34943: 30-12-2021 22:30:00          B num_var2 101.25896 75.61385  607.5538
# 34944: 30-12-2021 23:00:00          E num_var2  77.86649 50.45125  467.1989
# 34945: 30-12-2021 23:30:00          D num_var2 152.96687 66.00130  917.8012
# 34946: 31-12-2021 00:00:00          E num_var2 245.89731       NA  245.8973

tidyverse solution

using dplyr >= v1.1.0

df2 %>%
  left_join(df1, join_by(closest(date_time >= date_time))) %>%
  group_by(date_time = date_time.y) %>%
  summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
# A tibble: 6 × 7
#   date_time           num_var1_mean num_var1_sd num_var1_sum num_var2_mean num_var2_sd num_var2_sum
#   <chr>                       <dbl>       <dbl>        <dbl>         <dbl>       <dbl>        <dbl>
# 1 01-01-2021 00:00:00         149.         88.6         896.         116.         91.5         697.
# 2 01-01-2021 00:30:00         122.         60.8         731.          71.8        71.8         431.
# 3 01-01-2021 01:00:00         114.         86.5         681.         165.         75.9         988.
# 4 01-01-2021 01:30:00         187.         61.9        1120.         156.         81.2         936.
# 5 01-01-2021 02:00:00          87.2        86.2         523.         130.         66.9         781.
# 6 01-01-2021 02:30:00         125.         NA           125.         251.         NA           251.

Upvotes: 1

Related Questions