Reputation: 3213
I have a table in my SQL Server 2017 database containing the following data in part:
My intent is to create a multivariate polynomial regression for each of the 19 columns where LikingOrder is my dependent variable and each of the 19 column values for a given RespID are the independent variables.
The end result should be the highest regression value for each column C1 through C19 per RespID. The final results should look something like this:
I've read about polym and have attempted to use it in the following script:
ALTER PROCEDURE [dbo].[spRegressionPeak]
@StudyID int
AS
BEGIN
Declare @sStudyID VARCHAR(50)
Set @sStudyID = CONVERT(VARCHAR(50),@StudyID)
--We use IsNull values to pass zeroes where an average wasn't calculated os
that the polynomial regression can be calculated.
DECLARE @inquery AS NVARCHAR(MAX) = '
Select
c.StudyID, c.RespID, c.LikingOrder, avg(C1) as C1, avg(C2) as C2, avg(C3) as
C3, avg(C4) as C4, avg(C5) as C5, avg(C6) as C6, avg(C7) as C7, avg(C8) as
C8, avg(C9) as C9, avg(C10) as C10,
avg(C11) as C11, avg(C12) as C12, avg(C13) as C13, avg(C14) as C14, avg(C15)
as C15, avg(C16) as C16, avg(C17) as C17, avg(isnull(C18,0)) as C18, avg(C19)
as C19
from ClosedStudyResponses c
where c.StudyID = @StudyID
group by StudyID, RespID, LikingOrder
order by RespID
--We are setting @inquery aka InputDataSet to be our initial dataset.
--R Services requires that a data.frame be passed to any calculations being
generated. As such, df is simply data framing the @inquery data.
--The res object holds the polynomial regression results by RespondentID and
LikingOrder for each of the averages in the @inquery resultset.
EXEC sp_execute_external_script @language = N'R'
, @script = N'
studymeans <- InputDataSet
df <- data.frame(studymeans)
res1 <- lm(df$LikingOrder ~ polym(df$c1, df$c2, df$c3, df$c4, df$c5, df$c6, df$c7, df$c8, df$c9,
df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE))
res <- data.frame(res1)
'
, @input_data_1 = @inquery
, @output_data_1_name = N'res'
, @params = N'@StudyID int'
,@StudyID = @StudyID
--- Edit this line to handle the output data frame.
WITH RESULT SETS ((RespID int, res varchar(max)));
END;
The above stored procedure gives the following error when provided a valid StudyID:
Error in model.frame.default(formula = df$LikingOrder ~ polym(df$c1, df$c2,
:
variable lengths differ (found for 'polym(df$c1, df$c2, df$c3, df$c4, df$c5,
df$c6, df$c7, df$c8, df$c9, df$c10, df$c11, df$c12, df$c13, df$c14, df$c15,
df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)')
Calls: source ... lm -> eval -> eval -> <Anonymous> -> model.frame.default
In addition: There were 19 warnings (use warnings() to see them)
Is this the proper use of polym? If not, how do I achieve the goal of calculating 19 separate regressions? Finally, how do I programmatically determine the highest value for each regression?
Upvotes: 0
Views: 372
Reputation: 1883
Based on the question and discussion in comments, assumptions
made are:
RespID
: is categorical parameter
, and not used in model fitStudyID
: ignored in sample data LinkingOrder
: is the dependent variable i.e. response
(not categorical) C1 to C19
: independent variables
are numerical
Objective
: To identify the regression coefficients of a linear fit
to variables C1
through C19
Note
: A polynomial fit
is not added because the final requested table in question appears to have no iteraction terms listed.Resource
: Chapter 3, 5 in ISLRCreate sample dataframe
StudyID <- rep(10001, 100)
RespID <- c(rep(117,25), rep(119,25), rep(120,25), rep(121,25))
LinkingOrder <- floor(runif(100, 1, 9))
df <- data.frame(StudyID, RespID, LinkingOrder)
# Create columns C1 to C19
for (i in c(1:19)){
vari <- paste("C", i, sep = "")
df[vari] <- floor(runif(100, 0, 9))
}
# Convert RespID to categorical variable
df$RespID <- as.factor(RespID)
Fit lm() and store coefficients in table format
Note: The Intercept term is included in the table
# Fit lm() and store coefficients in a table
final_table <- data.frame()
for (respid in unique(df$RespID)){
data <- df[df['RespID']==respid, ]
data <- subset(data, select = -c(StudyID, RespID))
lm.fit <- lm(LinkingOrder ~ ., data=data)
# Save to table
final_table <- rbind(final_table, data.frame(t(unlist(lm.fit$coefficients))))
}
Upvotes: 1