gir emee
gir emee

Reputation: 71

Adding descriptive statistics to a function when making use of the ellipsis as input variable

For an assignment I have created a function in R that calculates the regression coefficients, predicted values and residuals of data that is useful for multiple linear regression. It did that as follows:

MLR <- function(y_var, ...){  
  
  y <- y_var  
  X <- as.matrix(cbind(...))  
  
  intercept <- rep(1, length(y)) 
  
  X <- cbind(intercept, X) 
  
  regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y  
  
  predicted_val <- X %*% regression_coef 
  
  residual_val <- y - predicted_val 
 
  
  scatterplot <- plot(predicted_val, residual_val,
                      ylab = 'Residuals', xlab = 'Predicted values',
                      main = 'Predicted values against the residuals',
                      abline(0,0))
 
  list('y' = y, 
       'X' = X, 
       'Regression coefficients' = regression_coef,
       'Predicted values' = predicted_val, 
       'Residuals' = residual_val,
       'Scatterplot' = scatterplot
       )
}

Now, my struggle is to add descriptive statistics of my input variables. Since I want my independent variables to be able to be any number, I used the ellipsis as input variable. Is there a way to calculate useful descriptive statistics (mean, variance, standard deviation) of my independent variables (defined by the ...)?

This

mean(...)

does not work...

Thank you for the replies already!

Upvotes: 1

Views: 51

Answers (2)

M_Kos
M_Kos

Reputation: 78

I think I've got it. Unfortunately, the ellipsis seems to be quite quirky to work with them. Check if the cbind(...) functions correctly inside your function (when I've checked it at the output, it was only 1 column wide, while I input 2 variables into it, and that don't seem right.

My solution don't read variable names - it uses placeholder names (Var_1, Var_2, ... , Var_n)


MLR <- function(y_var, ...){  
  
  # these two packages will come in handy
  
  require(dplyr)
  require(tidyr)
  
  y <- y_var  
  X <- as.matrix(cbind(...))
  
  # firstly, we need to make df/tibble out of ellipsis
  
  X2 <- list(...)
  
  n <- tibble(n = rep(0, times = length(y)))
  
  index <- 0
  
  for(Var in X2){
    
    index <- index + 1
    n[, paste0("Var_", index)] <- Var
    
  }
  
  # after the df was created, now it's time for calculating desc
  # Using tidyr::gather with dplyr::summarize creates nice summary, 
  # where each row is another variable
  
  descriptives <- tidyr::gather(n, key = "Variable", value = "Value") %>%
    group_by(Variable) %>%
    summarize(mean = mean(Value), var = var(Value), sd = sd(Value), .groups = "keep")
  
  # everything except the output list is the same
  
  intercept <- rep(1, length(y)) 
  
  X <- cbind(intercept, X) 
  
  regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y  
  
  predicted_val <- X %*% regression_coef 
  
  residual_val <- y - predicted_val 
  
  
  scatterplot <- plot(predicted_val, residual_val,
                      ylab = 'Residuals', xlab = 'Predicted values',
                      main = 'Predicted values against the residuals',
                      abline(0,0))
  
  
  list('y' = y, 
       'X' = X, 
       'Regression coefficients' = regression_coef,
       'Predicted values' = predicted_val, 
       'Residuals' = residual_val,
       'Scatterplot' = scatterplot,
       'descriptives' = descriptives[-1,] # need to remove the first row 
                                          # because it is "n" placeholder
  )
}

Upvotes: 0

Duck
Duck

Reputation: 39605

Try this slight changes on your function. I have applied to some variables of iris dataset. You can compute the desired statistics over X and then output as an additional slot for your output. Here the code:

#Function
MLR <- function(y_var, ...){  
  
  y <- y_var
  X <- as.matrix(cbind(...))  
  RX <- X
  
  intercept <- rep(1, length(y)) 
  
  X <- cbind(intercept, X) 
  
  regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y  
  
  predicted_val <- X %*% regression_coef 
  
  residual_val <- y - predicted_val 
  
  
  scatterplot <- plot(predicted_val, residual_val,
                      ylab = 'Residuals', xlab = 'Predicted values',
                      main = 'Predicted values against the residuals',
                      abline(0,0))
  
  #Summary
  #Stats
  DMeans <- apply(RX,2,mean,na.rm=T)
  DSD <- apply(RX,2,sd,na.rm=T)
  DVar <- apply(RX,2,var,na.rm=T)
  DSummary <- rbind(DMeans,DSD,DVar)
  #Out
  list('y' = y, 
       'X' = X, 
       'Regression coefficients' = regression_coef,
       'Predicted values' = predicted_val, 
       'Residuals' = residual_val,
       'Scatterplot' = scatterplot,
       'Summary' = DSummary
  )
}
#Apply
MLR(y_var = iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Length)

The final slot of the output will look like this:

$Scatterplot
NULL

$Summary
            [,1]     [,2]
DMeans 3.0573333 3.758000
DSD    0.4358663 1.765298
DVar   0.1899794 3.116278

Upvotes: 1

Related Questions