Does Lasso require a full rank regressor matrix to ensure consistent variable selection?

Question

Assume we want to run a regression on some data. Let's say we have n = 48 individuals and observe data for T = 13 time periods, resulting in a total of 624 observations. After taking first differences, we end up with 576 observations. Since we took first differences, we have effectively accounted for individual-specific fixed effects.

If I still want to include time-invariant variables, I can only include 47 of them, because including more would lead to a rank-deficient control matrix.

In ordinary least squares (OLS), I understand why this is an issue: the matrix (X'X) is not invertible. Now, if I knew that only a few time-invariant variables were relevant, I could apply Lasso to select a subset, and I would end up with a full-rank regressor matrix.

My question is: I know that in the context of Lasso, the Restricted Eigenvalue (RE) condition must hold to ensure stable results, fast convergence, and consistent variable selection. Technically, this condition is violated if I don't drop columns beforehand, correct? The problem is, it is arbitrary which columns I drop, and this decision can significantly influence the outcome because the choice of Lasso depends on which columns are dropped.

So, I was wondering: can I feed Lasso the rank-deficient control matrix? What issues would arise from this approach? I have always understood the RE condition to imply that highly pairwise correlated variables pose a problem, as Lasso may arbitrarily drop one of them. But what does the RE condition imply in the case of linearly dependent columns?

Another example would be to drop a reference category from some categorical variable to avoid perfect multicollinearity.

I have seen the approach of dropping columns to obtain a full rank matrix before applying Lasso and I wondered if this is even necessary.

Does Lasso require a full rank regressor matrix to ensure consistent variable selection?

Answers (0)

Related Questions