another_newbie
another_newbie

Reputation: 139

How to parametrize piecewise regression coefficient to represent the slope for the following interval (instead of the change in the slope)

Consider the following dataset

Quantity <- c(25,39,45,57,70,85,89,100,110,124,137,150,177)
Sales <- c(1000,1250,2600,3000,3500,4500,5000,4700,4405,4000,3730,3400,3300)
df <- data.frame(Quantity,Sales)
df

Plotting the data, the distribution of observations is clearly non-linear, but presents a likely breaking-point around Quantity = 89 (I skip the plot here). Therefore, I built a joint piecewise linear model as follows

df$Xbar <- ifelse(df$Quantity>89,1,0)
df$diff <- df$Quantity - 89

reg <- lm(Sales ~ Quantity + I(Xbar * (Quantity - 89)), data = df)
summary(reg)

or simply

df$X <- df$diff*df$Xbar

reg <- lm(Sales ~ Quantity + X, data = df)
summary(reg)   

However, according to this parametrization, the coefficient of X represents the change in the slope from the preceding interval.

How can I parametrize the relevant coefficient to rather represent the slope for the second interval?

I did some research but I was unable to find the desired specification, apart from some automatization in stata (see the voice 'marginal' here https://www.stata.com/manuals13/rmkspline.pdf).

Any help is much appreciated. Thank you!

Acknowledgement: the workable example is retrieved from https://towardsdatascience.com/unraveling-spline-regression-in-r-937626bc3d96

Upvotes: 1

Views: 281

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270055

The key here is to use a logical variable is.right which is TRUE for the points to the right of 89 and FALSE otherwise.

From the the output shown 60.88 is the slope to the left of 89 and -19.97 is the slope to the right. The lines intersect at Quantity = 89, Sales = 4817.30.

is.right <- df$Quantity > 89
fm <- lm(Sales ~ diff : is.right, df)

fm
## Call:
## lm(formula = Sales ~ diff:is.right, data = df)
##
## Coefficients:
##        (Intercept)  diff:is.rightFALSE   diff:is.rightTRUE  
##            4817.30               60.88              -19.97  

Alternatives

Alternately if you want to use Xbar from the question do it this way. It gives the same coefficients as fm.

fm2 <- lm(Sales ~ diff : factor(Xbar), df)

or

fm3 <- lm(Sales ~ I(Xbar * diff) + I((1 - Xbar) * diff), df)

Double check with nls

We can double check these using nls with the following formulation which makes use of the fact that if we extend both lines the one to use at any Quantity is the lower of the two.

st <- list(a = 0, b1 = 1, b2 = -1)
fm4 <- nls(Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89)), start = st)
fm4
## Nonlinear regression model
##   model: Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89))
##    data: parent.frame()
##       a      b1      b2 
## 4817.30   60.88  -19.97 
## residual sum-of-squares: 713120
##
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 2.285e-09

This would also work:

fm5 <- nls(Sales ~ a + ifelse(Quantity > 89, b2, b1) * diff, df, start = st)

Plot

Here is a plot:

plot(Sales ~ Quantity, df)
lines(fitted(fm) ~ Quantity, df)

screenshot

Model matrix

And here is the model matrix for the linear regression:

> model.matrix(fm)
   (Intercept) diff:is.rightFALSE diff:is.rightTRUE
1            1                -64                 0
2            1                -50                 0
3            1                -44                 0
4            1                -32                 0
5            1                -19                 0
6            1                 -4                 0
7            1                  0                 0
8            1                  0                11
9            1                  0                21
10           1                  0                35
11           1                  0                48
12           1                  0                61
13           1                  0                88

Upvotes: 2

StupidWolf
StupidWolf

Reputation: 46968

If you know the breakpoints, then you almost have the model, it should be:

fit=lm(Sales ~ Quantity + Xbar + Quantity:Xbar,data=df)

Because if you don't introduce a new intercept (Xbar), it will start from the intercept already in the model, which will not work. We can plot it:

plot(df$Quantity,df$Sales)
newdata = data.frame(Quantity=seq(40,200,by=5))
newdata$Xbar= ifelse(newdata$Quantity>89,1,0)
lines(newdata$Quantity,predict(fit,newdata))

enter image description here

The coefficients are:

summary(fit)

Call:
lm(formula = Sales ~ Quantity * Xbar, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-527.9 -132.2  -15.1  148.1  464.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -545.435    327.977  -1.663    0.131    
Quantity        59.572      5.746  10.367 2.65e-06 ***
Xbar          7227.288    585.933  12.335 6.09e-07 ***
Quantity:Xbar  -80.133      6.856 -11.688 9.64e-07 ***

And the coefficient of the 2nd slope is 59.572+(-80.133) = -20.561

Upvotes: 1

Related Questions