Cameron
Cameron

Reputation: 320

predFit() generates an unreasonably sized prediction file when using lapply() in R?

Following the example structure given by user tpetzdoldt in the answer to (Using predictNLS to create confidence intervals around fitted values in R?),

library(tidyverse)
 library(investr)
 data <- tibble(date = 1:7,
                cases = c(0, 0, 1, 4, 7, 8.5, 8.5))

    model <- nls(cases ~ SSlogis(log(date), Asym, xmid, scal), data= data )
    new.data <- data.frame(date=seq(1, 10, by = 0.1))
    interval <- as_tibble(predFit(model, newdata = new.data, interval = "confidence", level= 0.9)) %>% 
      mutate(date = new.data$date)

I then attempted to apply these same concepts to my own data (reproducible version generated here):

#Trying to create a reproducible example:

string_temp <- c(5, 12, 43, 12, 0.5, 11, 16, 15, 10, 8)
string_resp <- c(22, 15, 106, 18, 9, 14, 32, 11, 1, 4)
string_id <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", 
               "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V")

temp <- rep(string_temp, 220)
resp <- rep(string_resp, 220)
id <- rep(string_id, 100)

data_model <- data.frame(temp, resp, id)

#Data for predictions:

predictions <- runif(122735)

predictions <- data.frame(predictions)
predictions <- predictions %>% rename(temp = predictions)

#Split by identity:


data_model_split <- data_model %>% split(data_model$id)


#model:


model <- lapply(data_model_split, function(d) nls(resp ~ a * exp(b * temp), 
                                                  start = list(a = 0.8, b = 0.1), 
                                                  data = d))

#results:

results <- lapply(1:2, function(i) { 
  predFit(model[[i]], newdata = predictions, interval = "confidence", level = 0.9)})

I get the following error:

 Error: cannot allocate vector of size 112.2 Gb 

It seems strange that these adjustments would generate a data frame of that size. The dataframe that was generated in the example above was only 4 columns wide. I am feeding the 22 models that "model" generates 122,000 rows, but I still am shocked and sure that the hypothetical 4 column x 3,000,000 data frame that it produces shouldn't be nearly 1 Gb large. Is something going wrong with my application of lapply() in this case? I apologize for the lack of reproducibility in my personal example, as the dataset is very large, but I hope that maybe the issue lies somewhere in my code rather than my dataset. If helpful, I can try and generate a reproducible proxy for my data.

Upvotes: 2

Views: 179

Answers (1)

runr
runr

Reputation: 1146

The error is caused by the following line in predFun code:

v0 <- diag(f0 %*% tcrossprod(solve(crossprod(R1)), f0))

which is trying to construct a 122735 x 122735 matrix (by your example), and taking a diagonal of it. Constructing a matrix of such size in base R can take a lot of space. However, note that the previous function is equivalent to:

library(magrittr)

v0 <- lapply(1:nrow(f0), function(rw){
  f0[rw,,drop=F] %*% tcrossprod(solve(crossprod(R1)), f0[rw,,drop=F])
}) %>% do.call(c,.)

I.e., we don't need the whole matrix at once if we can just sum by row.

Notes:

  1. I'm sure there's an easier/quicker way to achieve the same that the original code is trying. There might be some alternative libraries for the task you're looking for that can achieve the same more efficiently, but these are not the focus of this answer.

  2. If you're set on using specifically predFun, it's possible to correct the code and override the original function.

For the override, I'm not an expert and there must exist a cleaner/more elegant way to do that. However, one example might be by extracting the original code, and fixing the issue at hand:

'predFit.nls_custom' <- function (object, newdata, se.fit = FALSE, interval = c("none", 
                                                        "confidence", "prediction"), level = 0.95, adjust = c("none", 
                                                                                                              "Bonferroni", "Scheffe"), k, ...) 
{
  require(magrittr)
  adjust <- match.arg(adjust)
  compute.se.fit <- if (se.fit || (interval != "none")) 
    TRUE
  else FALSE
  if (object$call$algorithm == "plinear") {
    stop(paste("The Golub-Pereyra algorithm for partially linear least-squares \n               models is currently not supported."), 
         call. = FALSE)
  }
  newdata <- if (missing(newdata)) {
    eval(getCall(object)$data, envir = parent.frame())
  }
  else {
    as.data.frame(newdata)
  }
  if (is.null(newdata)) {
    stop("No data available for predictions.", call. = FALSE)
  }
  xname <- intersect(all.vars(formula(object)[[3]]), colnames(newdata))
  pred <- object$m$predict(newdata)
  if (compute.se.fit) {
    param.names <- names(coef(object))
    for (i in 1:length(param.names)) {
      assign(param.names[i], coef(object)[i])
    }
    assign(xname, newdata[, xname])
    form <- object$m$formula()
    rhs <- eval(form[[3]])
    if (is.null(attr(rhs, "gradient"))) {
      f0 <- attr(numericDeriv(form[[3]], param.names), 
                 "gradient")
    }
    else {
      f0 <- attr(rhs, "gradient")
    }
    R1 <- object$m$Rmat()

    # Applied fix below: 
    v0 <- lapply(1:nrow(f0), function(rw){
      f0[rw,,drop=F] %*% tcrossprod(solve(crossprod(R1)), f0[rw,,drop=F])
    }) %>% do.call(c,.) 
    # --- End of fix
    
    se_fit <- sqrt(Sigma(object)^2 * v0)
  }
  interval <- match.arg(interval)
  if (interval == "none") {
    res <- pred
  }
  else {
    crit <- if (adjust == "Bonferroni") {
      qt((level + 2 * k - 1)/(2 * k), df.residual(object))
    }
    else if (adjust == "Scheffe") {
      if (interval == "confidence") {
        p <- length(coef(object))
        sqrt(p * qf(level, p, df.residual(object)))
      }
      else {
        sqrt(k * qf(level, k, df.residual(object)))
      }
    }
    else {
      qt((level + 1)/2, df.residual(object))
    }
    if (interval == "confidence") {
      lwr <- pred - crit * se_fit
      upr <- pred + crit * se_fit
    }
    else {
      lwr <- pred - crit * sqrt(Sigma(object)^2 + se_fit^2)
      upr <- pred + crit * sqrt(Sigma(object)^2 + se_fit^2)
    }
    res <- cbind(fit = pred, lwr = lwr, upr = upr)
  }
  if (se.fit) {
    res <- list(fit = res, se.fit = se_fit, df = df.residual(object), 
                residual.scale = Sigma(object))
  }
  return(res)
}

Next, one way is to include the following code that overrides predFit.nls method with our custom variant, predFit.nls_custom. (See here for other ways to override).

assignInNamespace("predFit.nls",predFit.nls_custom,ns="investr")
Sigma <- investr:::Sigma
Sigma.nls <- investr:::Sigma.nls

And re-running the original code:

results <- lapply(1:2, function(i) { 
  predFit(model[[i]], newdata = predictions, interval = "confidence", level = 0.9)}
  )

which now should work without issues. If it didn't, there may be issues with applying the override.

Upvotes: 1

Related Questions