stackoverflowuser2010
stackoverflowuser2010

Reputation: 40969

R linear regression with lm - how to deal with categorical variables with thousands of values (like city or zip code)?

I am using R and the linear regression method function lm() to build a prediction model for business sales of retail stores. Among the many dependent feature variables in my dataset, there are some categorical (factor) features that can take on thousands of different values, such as zip code (and/or city name). For example, there are over 6000 different zip codes for California alone; if I instead use city, there are over 400 cities.

I understand that lm() creates a variable for each value of a categorical feature. The problem is that when I run lm(), the explosion of variables takes a lot of memory and a really long time. How can I avoid or handle this situation with my categorical variables?

Upvotes: 1

Views: 1472

Answers (1)

Jacob H
Jacob H

Reputation: 4513

Your intuition to move from zip codes to cities is good. However, the question is, is there a further level of spatial aggregation which will capture important spatial variation, but will result in the creation of less categorical (i.e. dummy) variables? Probably. Depending on your question, simply including a dummy for rural/suburban/urban maybe all you need.

In your case geographic region is likely a proxy meant to capture variation in socio-economic data. If so, why not include the socio-economic data directly. To do this you could use your city/zip data to link to US census data.

However, if you really need/want to include cities, try estimating a fixed effect model. The within-estimator that results differences out time invariant categorical coefficients such as your city coefficients.

Even if you find a way to obtain an OLS estimate with 400 cities in R, I would strongly encourage you not do use an OLS estimator, use a Ridge or Lasso estimator. Unless your data is massive (it can't be too big since your using R), the inclusive of so many dummy variables is going to dramatically reduce the degrees of freedom, which can lead to over-fitting and generally poorly estimated coefficients and standard errors.

In a slightly more sophisticated language, when degrees of freedom are low the minimization problem you solve when you estimate the OLS is "ill-posed", consequently you should use a regularization. For example, a Ridge Regression (i.e. Tikhonov regularization), would be a good solution. Remember, however, Ridge regression is a biased estimator and therefore you should perform bias-correction.

My solutions in order of my preference:

  1. Aggregate up to a coarser spatial area (i.e. maybe a regions instead of cities)
  2. Fixed effect estimator.
  3. Ridge regression.

If you don't like my suggestions, I would suggest you pose this question on cross validated. IMO your question is closer to a statistics question than a programming question.

Upvotes: 3

Related Questions