Reputation: 43
Good afternoon everyone!
I have been working on this regression for a some time now and have built a model I am somewhat satisfied of.
Here is the code and graph result for it. My constraints are that I must set the starting point at the coordinates (maturity = 10 and yield_change = 0) and that the breakpoints must be at maturities 15, 20 and 30. (I want the model to start at maturity 10 and finish at maturity 50).
To address the first constraint I removed 10 from my maturity vector and set the model to pass through the origin.
The bond_data Dataframe is composed of two vectors; the first column "yield_change" and the second column "maturity" have the following values:
yield_change <- c(-1.2 -0.9 -1.8 -1.4 -1.8 -2.1 -2.3 -2.1 -2.5 -2.2 -2.4 -2.5 -2.4 -2.4 -3.0 -2.6 -5.1 -4.8 -4.9 -5.0 -5.0 -6.2 -6.1 -6.3 -5.0 -5.0)
maturity <- c(10.27945 10.86027 11.77534 12.35616 12.52055 13.35890 13.86301 14.28219 14.35890 15.35890 15.86301 16.77808 17.36164 17.86575 18.36164 21.86849 22.52877 23.86849 24.36438 25.36712 26.87123 27.87123 28.87123 29.87397 44.37808 49.38356)
This is the maturity vector before I remove 10 from each value.
Here is the regression program and graph.
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
bond_data <- read_excel("Book2.xlsx")
bond_data <- bond_data[-1,-c(2,3)]
colnames(bond_data) <- c("yield_change","maturity")
bond_data["maturity"] <- as.numeric(bond_data[["maturity"]])
#FITTING TEN YEAR AT ZERO
bond_data["maturity"] <- bond_data$maturity - 10
model_sub <- lm(yield_change~maturity+0, data = bond_data)
segmented.model <- segmented(model_sub,seg.Z=~ maturity,
psi = list(maturity = c(5,10,20)),fixed.psi = c(5,10,20),
control = seg.control(it.max = 0, n.boot = 50))
summary(segmented.model)
o <- segmented.model
xp <- c(0,o$psi[,"Est."], 40)
new_data <- data.frame(xp)
colnames(new_data) <- "maturity"
new_data$dummy1 <- pmax(new_data$maturity - o$psi[1,2], 0)
new_data$dummy2 <- pmax(new_data$maturity - o$psi[2,2], 0)
new_data$dummy3 <- pmax(new_data$maturity - o$psi[3,2], 0)
new_data$dummy4 <-I(new_data$maturity > o$psi[1,2]) * coef(o)[2]
new_data$dummy5 <-I(new_data$maturity > o$psi[2,2]) * coef(o)[3]
new_data$dummy6 <-I(new_data$maturity > o$psi[3,2]) * coef(o)[4]
names(new_data)[-1] <- names(model.frame(o))[-c(1,2)]
yp <- predict(o,new_data)
plot(bond_data$maturity+10,bond_data$yield_change, pch=16, col="blue",ylim = c(-8,0),
xlab = "maturity",ylab = "yield_change")
lines(xp+10,yp)
#BREAKPOINT VALUES
break_maturities <- c(0,5,10,20,40)
maturities_df <- data.frame(break_maturities)
colnames(maturities_df) <- "break_maturity"
maturities_df$dummy1 <- pmax(maturities_df$break_maturity - o$psi[1,2], 0)
maturities_df$dummy2 <- pmax(maturities_df$break_maturity - o$psi[2,2], 0)
maturities_df$dummy3 <- pmax(maturities_df$break_maturity - o$psi[3,2], 0)
maturities_df$dummy4 <-I(maturities_df$break_maturity > o$psi[1,2]) * coef(o)[2]
maturities_df$dummy5 <-I(maturities_df$break_maturity > o$psi[2,2]) * coef(o)[3]
maturities_df$dummy6 <-I(maturities_df$break_maturity > o$psi[3,2]) * coef(o)[4]
names(maturities_df)[-1] <- names(model.frame(o))[-c(1,2)]
names(maturities_df)[1] <- "maturity"
fit <- predict(o,maturities_df)
points(break_maturities+10,fit, pch=18, col = "black", cex = 1.5)
break_yields <- data.frame(break_maturities = break_maturities+10,
yield_preds = fit)
breakpoint_yield_predictions <- break_yields #return 2
I also overlayed the plot of the exact same regression but without setting the first point to (10,0) to illustrate the problem I am trying to solve with weights.
(Here is the code for it)
#FITTING MODEL WITHOUT SETTING TEN YEAR AT ZERO
model_no_origin <- lm(yield_change~maturity, data = bond_data)
seg_no_origin <- segmented(model_no_origin,seg.Z=~ maturity,
psi = list(maturity = c(5,10,20)),fixed.psi = c(5,10,20),
control = seg.control(it.max = 0, n.boot = 50))
mno <- seg_no_origin
xp <- c(0,mno$psi[,"Est."], 40)
new_data <- data.frame(xp)
colnames(new_data) <- "maturity"
new_data$dummy1 <- pmax(new_data$maturity - mno$psi[1,2], 0)
new_data$dummy2 <- pmax(new_data$maturity - mno$psi[2,2], 0)
new_data$dummy3 <- pmax(new_data$maturity - mno$psi[3,2], 0)
new_data$dummy4 <-I(new_data$maturity > mno$psi[1,2]) * coef(mno)[2]
new_data$dummy5 <-I(new_data$maturity > mno$psi[2,2]) * coef(mno)[3]
new_data$dummy6 <-I(new_data$maturity > mno$psi[3,2]) * coef(mno)[4]
names(new_data)[-1] <- names(model.frame(mno))[-c(1,2)]
yp <- predict(mno,new_data)
lines(xp+10,yp,col = "red")
And we obtain the following graph (red line representing regression without stating point at (maturity = 10 and yield_change = 0) but with fixed breakpoint maturity values (15,20,30)
I am not that interested in applying weights to the observations above the 20 year maturity as we can see that the difference is minimal between the two models but mainly on all the observations before the 20 year maturity (the area of focus being that first segment I think so but unsure...)
Looking at the graph, I want the first breakpoint to go up in y value so that the values around it have smaller residuals, this is because the main application of my model is to predict the yield_change values at each breakpoint and we can see that there is a big difference in that value between the black and red line. The red line being better fitted to the data due to no initial point constraint has better fit value...
So I have tried to reduce the weight of those initial observations (I think the critical segment is the one between the 10-15 maturity range) that have much larger residuals than the rest of the model.
I have used the following formula with the model variable being the segmented.model I derive without using any weights.
weight <- 1 / lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2
Documentation on how to apply weights to a segmented model (using R segmented package is very scarce). I assumed that applying a lm to all the segmented model's residuals and fitted values would be wrong and so I segmented it by fixed interval as I know of all these ([10-15],[15-20],[20-30],[30-50]). Is that okay to do? Or should I stick with lm... I do not even know if this formula is the one I should be applying...
Here is how I implemented this logic for each regression segment...
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
bond_data <- read_excel("Book2.xlsx")
bond_data <- bond_data[-1,-c(2,3)]
colnames(bond_data) <- c("yield_change","maturity")
bond_data["maturity"] <- as.numeric(bond_data[["maturity"]])
#SEGMENTED MODEL FITING
#FITTING TEN YEAR AT ZERO
bond_data["maturity"] <- bond_data$maturity - 10
model_sub <- lm(yield_change~maturity+0, data = bond_data)
segmented.model <- segmented(model_sub,seg.Z=~ maturity,
psi = list(maturity = c(5,10,20)),fixed.psi = c(5,10,20),
control = seg.control(it.max = 0, n.boot = 50))
m <- segmented.model
summary(segmented.model)
o <- segmented.model
#10 TO 15 WEIGHTS
residuals_10 <- o$residuals[bond_data$maturity <= 5]
fitted_10 <- o$fitted.values[bond_data$maturity <= 5]
data_10 <- data.frame(residuals = abs(residuals_10),
fitted = fitted_10)
model_1 <- lm(residuals ~ fitted, data = data_10)
weight_1 <- 1 / model_1$fitted.values^2
#15 TO 20 WEIGHTS
residuals_15 <- o$residuals[bond_data$maturity > 5 & bond_data$maturity <= 10]
fitted_15 <- o$fitted.values[bond_data$maturity > 5 & bond_data$maturity <= 10]
data_15 <- data.frame(residuals = abs(residuals_15),
fitted = fitted_15)
model_2 <- lm(residuals ~ fitted, data = data_15)
weight_2 <- 1 / model_2$fitted.values^2
#20 TO 30 WEIGHTS
residuals_20 <- o$residuals[bond_data$maturity > 10 & bond_data$maturity <= 20]
fitted_20 <- o$fitted.values[bond_data$maturity > 10 & bond_data$maturity <= 20]
data_20 <- data.frame(residuals = abs(residuals_20),
fitted = fitted_20)
model_3 <- lm(residuals ~ fitted, data = data_20)
weight_3 <- 1 / model_3$fitted.values^2
#30 TO 50 WEIGHTS
residuals_30 <- o$residuals[bond_data$maturity > 20 & bond_data$maturity <= 40]
fitted_30 <- o$fitted.values[bond_data$maturity > 20 & bond_data$maturity <= 40]
data_30 <- data.frame(residuals = abs(residuals_30),
fitted = fitted_30)
model_4 <- lm(residuals ~ fitted, data = data_30)
weight_4 <- 1 / model_4$fitted.values^2
#Combined weight vector
weight <- c(weight_1,weight_2,weight_3,weight_4)
#WEIGHTED MODEL
weighted_lm_model <- lm(yield_change ~ maturity+0, data = bond_data, weights = weight)
piecewise_model <- segmented(weighted_lm_model,seg.Z=~ maturity,
psi = list(maturity = c(5,10,20)),fixed.psi = c(5,10,20),
control = seg.control(it.max = 0, n.boot = 50))
o <- piecewise_model
summary <- summary(o) #return 1
xp <- c(0,o$psi[,"Est."], 40)
new_data <- data.frame(xp)
colnames(new_data) <- "maturity"
RMSE <- sqrt(mean(o$residuals^2))
RMSE <- format(round(RMSE,3), nsmall = 3)
new_data$dummy1 <- pmax(new_data$maturity - o$psi[1,2], 0)
new_data$dummy2 <- pmax(new_data$maturity - o$psi[2,2], 0)
new_data$dummy3 <- pmax(new_data$maturity - o$psi[3,2], 0)
new_data$dummy4 <-I(new_data$maturity > o$psi[1,2]) * coef(o)[2]
new_data$dummy5 <-I(new_data$maturity > o$psi[2,2]) * coef(o)[3]
new_data$dummy6 <-I(new_data$maturity > o$psi[3,2]) * coef(o)[4]
names(new_data)[-1] <- names(model.frame(o))[-c(1,2,3)]
yp <- predict(o,new_data)
plot(bond_data$maturity+10,bond_data$yield_change, pch=16, col="blue",ylim = c(-8,0),
xlab = "maturity",ylab = "yield_change")
text(35,-2.5,paste("RMSE =",RMSE,sep = " "))
lines(xp+10,yp)
#BREAKPOINT VALUES
break_maturities <- c(0,5,10,20,40)
maturities_df <- data.frame(break_maturities)
colnames(maturities_df) <- "break_maturity"
maturities_df$dummy1 <- pmax(maturities_df$break_maturity - o$psi[1,2], 0)
maturities_df$dummy2 <- pmax(maturities_df$break_maturity - o$psi[2,2], 0)
maturities_df$dummy3 <- pmax(maturities_df$break_maturity - o$psi[3,2], 0)
maturities_df$dummy4 <-I(maturities_df$break_maturity > o$psi[1,2]) * coef(o)[2]
maturities_df$dummy5 <-I(maturities_df$break_maturity > o$psi[2,2]) * coef(o)[3]
maturities_df$dummy6 <-I(maturities_df$break_maturity > o$psi[3,2]) * coef(o)[4]
names(maturities_df)[-1] <- names(model.frame(o))[-c(1,2,3)]
names(maturities_df)[1] <- "maturity"
fit <- predict(o,maturities_df)
points(break_maturities+10,fit, pch=18, col = "black", cex = 1.5)
break_yields <- data.frame(break_maturities = break_maturities+10,
yield_preds = fit)
breakpoint_yield_predictions <- break_yields #return 2
Giving the graph:
This is where my second issue comes in, the 15 year breakpoint actually gets lower... This is because (I think) the weights calculated give a much higher score to those observations close to the 15 maturity breakpoint as they have very little residuals...
If I want to apply some weights to only some observations like the first three for example, to what weights do I set the others, 0, NA or 1? Not adding any weights to the values over the 20 year maturity has made my model behave very strangely...
What I have done as well is create a conditional construct where if a abs(residual) of the first segmented model (being part of that first segment/line) is above a certain value that I choose empirically (something that is not ideal), well I apply a lower value to that observation than to the others. (But the input values are chosen randomly...)
Essentially I think that I am taking the wrong approach and having done research online has led me to find almost nothing about this... I was also thinking to reduce the weight of those initial variables as well as maybe putting in a threshold of max weight the values close to the 15 year breakpoint can have.
To summarise my main goal is to address the change that the initial point constraint (maturity = 10 and yield_change = 0) has created in my model before the 20 year breakpoint.The main goal for this is to have accurate breakpoint predictions (like the one of the red line) while at the same time still having a somewhat accurate segmented regression line without overfitting my model...
This may be a bit long winded but I am very thankful if you took the time to read through it! Any help would be very much appreciated and I hope you have a great rest of the day!
Upvotes: 0
Views: 233
Reputation: 459
Good morning,
I do not think you need to use weights here. The challenge I think you are having is that if you estimate a model without an intercept you will typically not get as good a fit as if you estimate it with an intercept (typically one estimates a model without an intercept after one has centered the dependent variable around zero by subtracting the mean). Rather than using the segmented package, I would just create three variables for each "knot point" as (maturity-15)*(1 if maturity > 15 and 0 otherwise) and so forth. I think this is what you want:
maturity <- c(
10.27945, 10.86027, 11.77534, 12.35616, 12.52055, 13.35890, 13.86301,
14.28219, 14.35890, 15.35890, 15.86301, 16.77808, 17.36164, 17.86575,
18.36164, 21.86849, 22.52877, 23.86849, 24.36438, 25.36712, 26.87123,
27.87123, 28.87123, 29.87397, 44.37808, 49.38356
)
maturity15 <- (maturity - 15) * (maturity > 15)
maturity20 <- (maturity - 20) * (maturity > 20)
maturity30 <- (maturity - 30) * (maturity > 30)
plot(maturity, yield_change, col = "darkblue", ylim = c(-8, 0))
lm1 <- lm(yield_change ~ maturity + maturity15 + maturity20 + maturity30)
pred1 <- cbind(maturity, predict(lm1))
pred1 <- pred1[order(pred1[, 1]), ]
lines(pred1, col = "darkred",lwd=4)
weights0 = rep(1,length(maturity))
weights0 = 30*(maturity<12)
table(weights0)
lm2 <- lm(yield_change ~ maturity + maturity15 + maturity20 + maturity30,weights=weights0)
pred2 <- cbind(maturity, predict(lm2))
pred2 <- pred2[order(pred1[, 1]), ]
lines(pred1, col = "green")
which returns the plot below. See that the weighted and unweighted model are basically the same.
The way I think about weighted regression is that if you set the weight of one observation to 10 (and leave the others at 1) you are treating it as if it were 10 observations. You should get the same coefficients if you estimated an unweighted regression but repeated the observation 10 times. From the help("lm"), it states '(that is, minimizing sum(w*e^2))' so if set w=10 and all the others are 1, you have the same objective as if you just repeated the observation 10 times. Hope this is helpful.
Upvotes: 0