Reputation: 71
For an assignment I have created a function in R that calculates the regression coefficients, predicted values and residuals of data that is useful for multiple linear regression. It did that as follows:
MLR <- function(y_var, ...){
y <- y_var
X <- as.matrix(cbind(...))
intercept <- rep(1, length(y))
X <- cbind(intercept, X)
regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y
predicted_val <- X %*% regression_coef
residual_val <- y - predicted_val
scatterplot <- plot(predicted_val, residual_val,
ylab = 'Residuals', xlab = 'Predicted values',
main = 'Predicted values against the residuals',
abline(0,0))
list('y' = y,
'X' = X,
'Regression coefficients' = regression_coef,
'Predicted values' = predicted_val,
'Residuals' = residual_val,
'Scatterplot' = scatterplot
)
}
Now, my struggle is to add descriptive statistics of my input variables. Since I want my independent variables to be able to be any number, I used the ellipsis as input variable. Is there a way to calculate useful descriptive statistics (mean, variance, standard deviation) of my independent variables (defined by the ...)?
This
mean(...)
does not work...
Thank you for the replies already!
Upvotes: 1
Views: 51
Reputation: 78
I think I've got it. Unfortunately, the ellipsis seems to be quite quirky to work with them. Check if the cbind(...) functions correctly inside your function (when I've checked it at the output, it was only 1 column wide, while I input 2 variables into it, and that don't seem right.
My solution don't read variable names - it uses placeholder names (Var_1, Var_2, ... , Var_n)
MLR <- function(y_var, ...){
# these two packages will come in handy
require(dplyr)
require(tidyr)
y <- y_var
X <- as.matrix(cbind(...))
# firstly, we need to make df/tibble out of ellipsis
X2 <- list(...)
n <- tibble(n = rep(0, times = length(y)))
index <- 0
for(Var in X2){
index <- index + 1
n[, paste0("Var_", index)] <- Var
}
# after the df was created, now it's time for calculating desc
# Using tidyr::gather with dplyr::summarize creates nice summary,
# where each row is another variable
descriptives <- tidyr::gather(n, key = "Variable", value = "Value") %>%
group_by(Variable) %>%
summarize(mean = mean(Value), var = var(Value), sd = sd(Value), .groups = "keep")
# everything except the output list is the same
intercept <- rep(1, length(y))
X <- cbind(intercept, X)
regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y
predicted_val <- X %*% regression_coef
residual_val <- y - predicted_val
scatterplot <- plot(predicted_val, residual_val,
ylab = 'Residuals', xlab = 'Predicted values',
main = 'Predicted values against the residuals',
abline(0,0))
list('y' = y,
'X' = X,
'Regression coefficients' = regression_coef,
'Predicted values' = predicted_val,
'Residuals' = residual_val,
'Scatterplot' = scatterplot,
'descriptives' = descriptives[-1,] # need to remove the first row
# because it is "n" placeholder
)
}
Upvotes: 0
Reputation: 39605
Try this slight changes on your function. I have applied to some variables of iris
dataset. You can compute the desired statistics over X
and then output as an additional slot for your output. Here the code:
#Function
MLR <- function(y_var, ...){
y <- y_var
X <- as.matrix(cbind(...))
RX <- X
intercept <- rep(1, length(y))
X <- cbind(intercept, X)
regression_coef <- solve(t(X) %*% X) %*% t(X) %*% y
predicted_val <- X %*% regression_coef
residual_val <- y - predicted_val
scatterplot <- plot(predicted_val, residual_val,
ylab = 'Residuals', xlab = 'Predicted values',
main = 'Predicted values against the residuals',
abline(0,0))
#Summary
#Stats
DMeans <- apply(RX,2,mean,na.rm=T)
DSD <- apply(RX,2,sd,na.rm=T)
DVar <- apply(RX,2,var,na.rm=T)
DSummary <- rbind(DMeans,DSD,DVar)
#Out
list('y' = y,
'X' = X,
'Regression coefficients' = regression_coef,
'Predicted values' = predicted_val,
'Residuals' = residual_val,
'Scatterplot' = scatterplot,
'Summary' = DSummary
)
}
#Apply
MLR(y_var = iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Length)
The final slot of the output will look like this:
$Scatterplot
NULL
$Summary
[,1] [,2]
DMeans 3.0573333 3.758000
DSD 0.4358663 1.765298
DVar 0.1899794 3.116278
Upvotes: 1