SidC
SidC

Reputation: 3213

How to solve for highest value (regression peak) in multivariate polynomial regression using R?

I have a table in my SQL Server 2017 database containing the following data in part: enter image description here

My intent is to create a multivariate polynomial regression for each of the 19 columns where LikingOrder is my dependent variable and each of the 19 column values for a given RespID are the independent variables.

The end result should be the highest regression value for each column C1 through C19 per RespID. The final results should look something like this: enter image description here

I've read about polym and have attempted to use it in the following script:

ALTER PROCEDURE [dbo].[spRegressionPeak]   
@StudyID int
AS
BEGIN
Declare @sStudyID VARCHAR(50)
Set @sStudyID = CONVERT(VARCHAR(50),@StudyID)

--We use IsNull values to pass zeroes where an average wasn't calculated os 
that the polynomial regression can be calculated.
DECLARE @inquery  AS NVARCHAR(MAX) = '
    Select
c.StudyID, c.RespID, c.LikingOrder, avg(C1) as C1, avg(C2) as C2, avg(C3) as 
C3, avg(C4) as C4, avg(C5) as C5, avg(C6) as C6, avg(C7) as C7, avg(C8) as 
C8, avg(C9) as C9, avg(C10) as C10,
avg(C11) as C11, avg(C12) as C12, avg(C13) as C13, avg(C14) as C14, avg(C15) 
as C15, avg(C16) as C16, avg(C17) as C17, avg(isnull(C18,0)) as C18, avg(C19) 
as C19
from ClosedStudyResponses c
where c.StudyID = @StudyID
group by StudyID, RespID, LikingOrder
order by RespID 

--We are setting @inquery aka InputDataSet to be our initial dataset.  
--R Services requires that a data.frame be passed to any calculations being 
generated.  As such, df is simply data framing the @inquery data.
--The res object holds the polynomial regression results by RespondentID and 
LikingOrder for each of the averages in the @inquery resultset.
EXEC sp_execute_external_script @language = N'R'
, @script = N'
    studymeans <- InputDataSet

    df <- data.frame(studymeans) 

    res1 <- lm(df$LikingOrder ~ polym(df$c1, df$c2, df$c3, df$c4, df$c5, df$c6, df$c7, df$c8, df$c9, 
    df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)) 
    res <- data.frame(res1)

'
, @input_data_1 = @inquery
, @output_data_1_name = N'res'
, @params = N'@StudyID int'
,@StudyID = @StudyID 
--- Edit this line to handle the output data frame.
WITH RESULT SETS ((RespID int, res varchar(max)));
END;

The above stored procedure gives the following error when provided a valid StudyID:

Error in model.frame.default(formula = df$LikingOrder ~ polym(df$c1, df$c2,  
: 
variable lengths differ (found for 'polym(df$c1, df$c2, df$c3, df$c4, df$c5, 
df$c6, df$c7, df$c8, df$c9, df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, 
df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)')
Calls: source ... lm -> eval -> eval -> <Anonymous> -> model.frame.default
In addition: There were 19 warnings (use warnings() to see them)

Is this the proper use of polym? If not, how do I achieve the goal of calculating 19 separate regressions? Finally, how do I programmatically determine the highest value for each regression?

Upvotes: 0

Views: 372

Answers (1)

Nilesh Ingle
Nilesh Ingle

Reputation: 1883

Based on the question and discussion in comments, assumptions made are:

  • RespID: is categorical parameter, and not used in model fit
  • StudyID: ignored in sample data
  • LinkingOrder: is the dependent variable i.e. response (not categorical)
  • C1 to C19: independent variables are numerical

  • Objective: To identify the regression coefficients of a linear fit to variables C1 through C19

  • Note: A polynomial fit is not added because the final requested table in question appears to have no iteraction terms listed.
  • Resource: Chapter 3, 5 in ISLR

Create sample dataframe

StudyID <- rep(10001, 100)
RespID <- c(rep(117,25), rep(119,25), rep(120,25), rep(121,25))
LinkingOrder <- floor(runif(100, 1, 9))
df <- data.frame(StudyID, RespID, LinkingOrder)
# Create columns C1 to C19
for (i in c(1:19)){
  vari <- paste("C", i, sep = "")
  df[vari] <-  floor(runif(100, 0, 9))
}

# Convert RespID to categorical variable
df$RespID <- as.factor(RespID)

enter image description here

Fit lm() and store coefficients in table format

Note: The Intercept term is included in the table

# Fit lm() and store coefficients in a table
final_table <- data.frame()
for (respid in unique(df$RespID)){
  data <- df[df['RespID']==respid, ]
  data <- subset(data, select = -c(StudyID, RespID))

  lm.fit <- lm(LinkingOrder ~ ., data=data)

  # Save to table
  final_table <- rbind(final_table, data.frame(t(unlist(lm.fit$coefficients))))
}

enter image description here

Upvotes: 1

Related Questions