Optimizing dplyr summarise model run

Question

Looking to optimize summarise within dplyr when running a model over several obs. The end goal is to 1.) run a model for all relevant obs, and 2.) estimate the response value when the predictor variable ='s 0. There could be a solution where summarizing with the full model details/object doesn't need to be stored. I have converted the df to a data table to improve performance as well. See code below where df is a generic data frame. We are talking process here so reprex should not be required. Running a gam instead of scam takes a fraction of the time.

library(dplyr)
library(data.table)
library(dtplyr)
library(scam)

dt <- data.table(df)

ptm <- proc.time()
new_dat <- dt %>% 
   summarise(scam_model=list(scam(B ~ s(A, bs="mpi"))), .by = c(X,Y))
proc.time() - ptm

this portion is less likely to cause issue

pred_df <- data.frame(A = 0)
scam_predict <- function(m) predict(m,pred_df)
new_dat$scam_pred_0 <-sapply(new_dat$scam_model,scam_predict)

Update:

I updated the code (see below) and received an error alluding to model issues and potential parameter estimation/optimization issues - Error in bfgs_gcv.ubre(gcv.ubre_grad, rho = rho, G = G, env = env, control = control) : object 'old.alpha' not found

ptm <- proc.time()
new_dat <- dt %>% 
   fgroup_by(X,Y) %>%
   fsummarise(scam_model=list(scam(B ~ s(A, bs="mpi"))))
proc.time() - ptm

Optimizing dplyr summarise model run

Update:

Answers (0)

Related Questions