luchonacho
luchonacho

Reputation: 7177

Manual calculation of WLS does not match output of lm() in R

Weighted Least Squares can be computed in R using the command lm() with the weights option. To understand the formula, I computed them manually too but results are not the same:

# Sample from model
set.seed(123456789) 
x1 <- 10 * rbeta(500, 2, 6) 
x2 <- x1 + 2*rchisq(500, 3, 2) + rnorm(500)*sd(x1)
u <- rnorm(500) 
y <- 0.5 + 1.2*x1 - 0.7*x2 + u

# Produce weights (shouldn't matter how)
w <- x1/sd(x1)
# Manual WLS
y_WLS <- y*sqrt(w)
x1_WLS <- x1*sqrt(w)
x2_WLS <- x2*sqrt(w)
summary(lm(y_WLS ~ x1_WLS + x2_WLS))
# Automatic WLS
summary(lm(y ~ x1+x2, weights=w))

The two commands give a different result. It should be exactly the same. I am just following the instructions from Wooldridge (2019), section 8.4a, which relevant bits I have combined in the image below:

enter image description here

As you can read, WLS with weight w is equivalent to running OLS in a transformed model where each variable has been multiplied by the squared root of w, which is what I do above. Why the difference then?

Upvotes: 0

Views: 486

Answers (2)

luchonacho
luchonacho

Reputation: 7177

I realised the problem (thanks to this post). The transformation actually removes the constant and adds a new variable, which is the squared root of the weight. Thus, if instead of

summary(lm(y_WLS ~ x1_WLS + x2_WLS))

I use

summary(lm(y_WLS ~ 0 + sqrt(w) + x1_WLS + x2_WLS))

(removing the constant and adding sqrt(w) as regressor), the two are exactly the same. This is clearly implied in Wooldridge. I wasn't reading carefully enough.

Upvotes: 1

ssaha
ssaha

Reputation: 499

To my limited knowledge, weights are not necessarily be multiplied with the values, but rather can be conceptualized as the frequency of each values. In other words, each values should be replicated as many times as the weights. For example, let's round your w to the nearest integer, put them into a data.frame and replicate each that many times as the rounded w.

# create a data frame with an id and rounded weights
data <-data.frame(id=1:length(y),y=y,x1=x1,x2=x2,w=w,weight=round(w))     

# now replicate the rows as the value of weights
data.expanded<-data[rep(row.names(data),data$weight),]

# let's fit the manual WLS model
summary(lm(y ~ x1 + x2,data = data.expanded))

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.511824   0.096365    5.311  1.4e-07 ***
x1           1.182342   0.024249   48.758  < 2e-16 ***
x2          -0.693060   0.004643 -149.269  < 2e-16 ***

# also fit the automated WLS
summary(lm(y ~ x1+x2, weights=w))

    Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.519752   0.125816    4.131 4.24e-05 ***
x1           1.181847   0.031544   37.466  < 2e-16 ***
x2          -0.693566   0.006047 -114.694  < 2e-16 ***

As you can see, we get similar results in both approaches. The slight discrepancy you see because of the rounding of weights. For a large dataset, this discrepancy will be almost negligible.

Note that the id is created only to keep track of replications in the expanded data, no other reason.

Upvotes: 0

Related Questions