Recology
Recology

Reputation: 137

Grouping inside of a for-loop in R

I am using a modified quantile regression function found in a scientific paper. (source: https://arxiv.org/pdf/2111.04805.pdf)

I try to apply this particular methodology to my dataset, but I would like to perform this quantile regression based on a condition from my dataset. This means that I want to implement something similar to a group_by condition as found here:

qr_models = data %>%
  group_by(latitude) %>%
  do(model = rq(julian_day~year, tau = 1:99/100, method = "fn", data= .))

The way it is coded in the paper is the following and the loop is where I would like to add the group_by:

for (i in 1:n) {
  alpha = i/(n+1)
  alphas[i] = alpha
  betas <- smrq(X,y,tau=alpha) 
  vals0[i] <- sum(y<(X %*% betas))
}

where n is n <- 99 (the quantile resolution chosen); vals0 <- rep(0,n) and alphas <- rep(0,n). I tend to avoid using loops in R so I am a bit lost on how to do it.

Just in case it is needed to understand, the smrq() function is the one that is described in the paper mentioned previously, and is coded like this:

smrq <- function(X, y, tau){
  p = ncol(X)
  op.result <- optim(rep(0, p),
                     fn = minimize.logcosh,
                     method = 'BFGS',
                     X = X,
                     y = y,
                     tau = tau)
  beta <- op.result$par
  return (beta)
}

where X is the explanatory variable matrix and y the response variable.

I hope it is clear enough, if more details are needed I will be happy to update my post. Many thanks for the precious help.

Upvotes: 0

Views: 73

Answers (1)

Parfait
Parfait

Reputation: 107757

Consider encapsulating the author's entire processing in a user-defined method that receives data (swiss from authors) as input parameter along with other variables including formula (Fertility ~ . from authors) and response column ("Fertility" from authors).

Then, pass data subsets with group_by. Also, the authors for loop can be refactored to a vectorized loop such as sapply or vapply since the return is a numeric vector.

Generalized Functions

minimize.logcosh <- function(par, X, y, tau) {
    diff <- y-(X %*% par)
    check <- (tau-0.5)*diff+(0.5/0.7)*logcosh(0.7*diff)+0.4
    return(sum(check))
}

smrq <- function(X, y, tau){
    p <- ncol(X)
    op.result <- optim(
        rep(0, p),
        fn = minimize.logcosh,
        method = 'BFGS',
        X = X,
        y = y,
        tau = tau
    )
    beta <- op.result$par
    return(beta)
}

run_smrq <- function(data, fml, response) {
    x <- model.matrix(fml, data)[,-1]
    y <- data[[response]]
    X <- cbind(x, rep(1,nrow(x)))

    n <- 99
    betas <- sapply(1:n, function(i) smrq(X, y, tau=i/(n+1)))
    # betas <- vapply(1:n, function(i) smrq(X, y, tau=i/(n+1)), numeric(1))
    return(betas)        
}

Callers

Test authors' example

swiss <- datasets::swiss
smrq_models <- run_smrq(data=swiss, fml=Fertility~., response="Fertility")

dplyr (using group_map)

smrq_models <- data %>%
  group_by(latitude) %>%
  group_map(~ run_smrq(data=., fml=julian_day~year, response="julian_day")

base (using by, the object-oriented wrapper to tapply)

smrq_models <- by(
   data, 
   data$latitude, 
   function(sub) run_smrq(data=sub, fml=julian_day~year, response="julian_day")
)

Upvotes: 2

Related Questions