Reputation: 477
The problem: I have two datasets that are in essence time sequences with something measured at each timepoint. Df1 has a sample point every 30 mins, while df2 has a sample point every 5 min. I want to join the two datasets but to do that I need to summarise df2 so it also has only 1 value per 30 min. However, these intervals are not always constant and therefore this is the solution I've come up to, but I need to then be able to do this for every row of df1. I think there are purrr
functions that will allow me to map a custom function row wise, but I haven't quite figured out how to.
This is what I've done so far
0 - Packages
library(lubridate)
library(tidyverse)
library(wakefield)
library(magrittr)
1 - Generate dummy datasets for example
df1 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"),
as.POSIXct("2021-01-01 02:30:00", tz="GMT"),
by='30 min'), "%d-%m-%Y %H:%M:%S"),
other_vars = upper(6, k = 5, x = LETTERS, prob = NULL, name = "Upper"))
df2 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"),
as.POSIXct("2021-01-01 02:30:00", tz="GMT"),
by='5 min'), "%d-%m-%Y %H:%M:%S"),
num_var1 = runif(31, 0, 255),
num_var2 = runif(31, 0, 255))
2 - Add time at row i+1 as new column in df1
df1 %<>%
mutate(date_time_lead = lead(date_time))
3 - Subset rows of df2 with time points between time i and i+1 (in this case i == 1) of df1 and summarise the values (mean, sd, sum)
df2_sub <- df2 %>%
filter(date_time >= df1$date_time[1] & date_time < df1$date_time_lead[1]) %>%
summarise(var1_mean = mean(num_var1, na.rm = T),
var1_sd = sd(num_var1, na.rm = T),
var1_sum = sum(num_var1, na.rm = T),
var2_mean = mean(num_var2, na.rm = T),
var2_sd = sd(num_var2, na.rm = T),
var2_sum = sum(num_var2, na.rm = T))
Now, ideally I'd put all of this in a custom function and then use one of the purrr
functions to apply that function to each row of df1, but I haven't managed to figure out how to, so this is the next step
4 - Generate new columns in df1
df1$var1_mean <- NA
df1$var1_sd <- NA
df1$var1_sum <- NA
df1$var2_mean <- NA
df1$var2_sd <- NA
df1$var2_sum <- NA
5 - Add the newly generated summarised values to the ith row (in this case the first) of df1
df1[1, 4:9] <- df2_sub[1,]
Upvotes: 0
Views: 77
Reputation: 477
Ok after downloading the development version of dplyr so I could use the join_by()
function, I still get the following error:
>df2 %>%
left_join(df1, join_by(closest(date_time >= date_time))) %>%
group_by(date_time = date_time.y) %>%
summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
production code), not a <dplyr_join_by> object.
Run `rlang::last_error()` to see where the error occurred.
Running rlang::last_error()
gives
> rlang::last_error()
<error/rlang_error>
Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
production code), not a <dplyr_join_by> object.
---
Backtrace:
1. ... %>% ...
4. tidylog::left_join(., df1, join_by(closest(date_time >= date_time)))
5. tidylog:::log_join(...)
9. dplyr:::common_by.default(by = by, x = x, y = y)
Run `rlang::last_trace()` to see the full context.
Now, rlang::last_trace()
gives
> rlang::last_trace()
<error/rlang_error>
Error in `dplyr::common_by()`:
! `by` must be a (named) character vector, list, or NULL for natural joins (not recommended in
production code), not a <dplyr_join_by> object.
---
Backtrace:
▆
1. ├─... %>% ...
2. ├─dplyr::summarise(...)
3. ├─dplyr::group_by(., date_time = date_time.y)
4. └─tidylog::left_join(., df1, join_by(closest(date_time >= date_time)))
5. └─tidylog:::log_join(...)
6. ├─base::suppressMessages(dplyr::common_by(by = by, x = x, y = y))
7. │ └─base::withCallingHandlers(...)
8. ├─dplyr::common_by(by = by, x = x, y = y)
9. └─dplyr:::common_by.default(by = by, x = x, y = y)
10. └─rlang::abort(msg)
Since you managed to run it with the exact same dataset I assume this is a version issue, I updated dplyr and rlang to the latest versions but I might have left sth behind
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] wakefield_0.3.6 raster_3.5-15 sp_1.4-7 magrittr_2.0.3
[5] suncalc_0.5.1 rnaturalearthdata_0.1.0 rnaturalearth_0.1.0 kableExtra_1.3.4
[9] amt_0.1.7 cowplot_1.1.1 tidylog_1.0.2 lubridate_1.8.0
[13] sf_1.0-7 forcats_0.5.1 stringr_1.4.0 dplyr_1.1.0.9000
[17] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[21] ggplot2_3.3.6 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] fs_1.5.2 usethis_2.1.6 devtools_2.4.3 webshot_0.5.3 httr_1.4.3
[6] rprojroot_2.0.3 tools_4.2.1 backports_1.4.1 utf8_1.2.2 R6_2.5.1
[11] KernSmooth_2.23-20 DBI_1.1.2 colorspace_2.0-3 withr_2.5.0 tidyselect_1.2.0
[16] prettyunits_1.1.1 processx_3.5.3 curl_4.3.2 compiler_4.2.1 cli_3.3.0
[21] rvest_1.0.2 xml2_1.3.3 desc_1.4.1 scales_1.2.0 classInt_0.4-3
[26] callr_3.7.0 proxy_0.4-26 systemfonts_1.0.4 digest_0.6.29 svglite_2.1.0
[31] rmarkdown_2.14 htmltools_0.5.2 pkgconfig_2.0.3 sessioninfo_1.2.2 dbplyr_2.1.1
[36] fastmap_1.1.0 rlang_1.0.6.9000 readxl_1.4.0 rstudioapi_0.13 generics_0.1.3
[41] jsonlite_1.8.0 Matrix_1.4-1 Rcpp_1.0.8.3 munsell_0.5.0 fansi_1.0.3
[46] lifecycle_1.0.3 terra_1.5-21 stringi_1.7.6 brio_1.1.3 pkgbuild_1.3.1
[51] grid_4.2.1 crayon_1.5.1 lattice_0.20-45 haven_2.5.0 splines_4.2.1
[56] hms_1.1.1 knitr_1.39 ps_1.7.0 pillar_1.8.1 codetools_0.2-18
[61] clisymbols_1.2.0 pkgload_1.2.4 reprex_2.0.1 glue_1.6.2 evaluate_0.15
[66] data.table_1.14.2 remotes_2.4.2 modelr_0.1.8 vctrs_0.5.2.9000 tzdb_0.3.0
[71] Rdpack_2.3.1 testthat_3.1.4 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
[76] cachem_1.0.6 xfun_0.31 rbibutils_2.2.8 broom_0.8.0 e1071_1.7-9
[81] viridisLite_0.4.0 class_7.3-20 survival_3.3-1 memoise_2.0.1 units_0.8-0
[86] ellipsis_0.3.2
Upvotes: 0
Reputation: 27802
Are you looking for something like this?
Hard to tell if it is correct, since your sample data is not minimal, and I do not get the use of logical operators in the line:
date_time > df1$date_time[1] & date_time < df1$date_time_lead[1])
Which suggests that in the interval 0:00-0:30 you want to exclude the values of 0:00 and 0:30 ?
library(data.table)
# set to data.table format
setDT(df1)
setDT(df2)
# perform rolling join, to roll down to the nearest date_from
df2[, c("date_time_df1", "other_vars") := df1[df2, .(x.date_time, x.other_vars), on = .(date_time), roll = Inf]]
# melt to long
df2.melt <- melt(df2, measure.vars = patterns("^num_var"))
# summarise
df2.melt[, .(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
sum = sum(value, na.rm = TRUE)),
by = .(date_time = date_time_df1, other_vars, variable)]
# date_time other_vars variable mean sd sum
# 1: 01-01-2021 00:00:00 B num_var1 183.51225 57.36997 1101.0735
# 2: 01-01-2021 00:30:00 B num_var1 96.46239 69.12586 578.7744
# 3: 01-01-2021 01:00:00 D num_var1 163.74048 84.57741 982.4429
# 4: 01-01-2021 01:30:00 A num_var1 120.74330 41.14782 724.4598
# 5: 01-01-2021 02:00:00 E num_var1 109.74861 85.56809 658.4917
# ---
# 34942: 30-12-2021 22:00:00 E num_var2 140.67153 79.74211 844.0292
# 34943: 30-12-2021 22:30:00 B num_var2 101.25896 75.61385 607.5538
# 34944: 30-12-2021 23:00:00 E num_var2 77.86649 50.45125 467.1989
# 34945: 30-12-2021 23:30:00 D num_var2 152.96687 66.00130 917.8012
# 34946: 31-12-2021 00:00:00 E num_var2 245.89731 NA 245.8973
tidyverse solution
using dplyr >= v1.1.0
df2 %>%
left_join(df1, join_by(closest(date_time >= date_time))) %>%
group_by(date_time = date_time.y) %>%
summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
# A tibble: 6 × 7
# date_time num_var1_mean num_var1_sd num_var1_sum num_var2_mean num_var2_sd num_var2_sum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 01-01-2021 00:00:00 149. 88.6 896. 116. 91.5 697.
# 2 01-01-2021 00:30:00 122. 60.8 731. 71.8 71.8 431.
# 3 01-01-2021 01:00:00 114. 86.5 681. 165. 75.9 988.
# 4 01-01-2021 01:30:00 187. 61.9 1120. 156. 81.2 936.
# 5 01-01-2021 02:00:00 87.2 86.2 523. 130. 66.9 781.
# 6 01-01-2021 02:30:00 125. NA 125. 251. NA 251.
Upvotes: 1