cfp
cfp

Reputation: 208

R fixest: Main variables of interest are fixed effects in a huge data set

I am interested in estimating a Poisson fixed effects model with:

\log{\mathbb{E}[y_{i,j,t}]}=\beta_{A(i,j,t)}+\alpha_{i,t}+\gamma_{i,j}

where A(i,j,t)\in\mathbb{N} is the "age" of the (i,j,t) observation.

I am interested in the \beta_{\cdot} coefficients, not the other fixed effects.

My first attempt at estimating this was as follows:

library(readr)
Data <- read_csv("FullData.csv", col_types = cols(UPC_PRICE = col_factor(), WEEK = col_factor(), MOVE = col_integer(), STORE_COM_CODE = col_factor(), AGE = col_factor()))
library(fixest)
Results = fepois(MOVE ~ AGE | STORE_COM_CODE^UPC_PRICE + STORE_COM_CODE^WEEK, Data, nthreads=28, verbose=1000)

But this results in fepois attempting to create a full matrix of dummies from the AGE variable, which is too large to fit in memory. (There are around 150 million observations, and AGE goes up to about 400.)

As an alternative, I tried:

Results = fepois(MOVE ~ 1 | STORE_COM_CODE^UPC_PRICE + STORE_COM_CODE^WEEK + AGE, Data, nthreads=28, verbose=1000)
FE = fixef(Results)

With this approach, the fepois call completes successfully, but then it fails in the fixef call (to get the fixed effects, where the \beta_{\cdot} are now stored) with the message:

Problem getting FE, maximum iterations reached (1st order loop).NOTE: The fixed-effects are not regular, they cannot be straightforwardly interpreted. The number of references is only approximate.

Of course I could increase the number of iterations, but the fact I'm getting this message suggests there's probably a better approach I don't know about. ("Regularity" is also an issue with this approach. It doesn't matter if the estimation drops certain columns from the \alpha_{\cdot} and \gamma_{\cdot} fixed effects, but I do not want it to drop any columns from the \beta_{\cdot} fixed effects.)

How should I be approaching this estimation?


Incidentally: Despite setting nthreads, fepois still only uses one thread. Any ideas why? (Calling setFixest_nthreads(28) also makes no difference it seems.)


Update 1: Setting iter=100000000 within the fixef call makes no difference. I still get the same error, suggesting it's a different iteration count that's being hit.

Update 2: Here are the first 10000 lines of the data set: https://gist.github.com/tholden/7cf0b4b8ae2b6030b60b704766903612 (*)

Update 3: getFixest_nthreads() returns 28, as expected (that's what I set it to, and it's also half the number of logical processors on my machine).

Upvotes: 0

Views: 180

Answers (1)

Josh Allen
Josh Allen

Reputation: 1280

If I understand your problem correctly you are getting something like this

library(fixest)
library(readr)


examp_dat1 = read_csv('https://gist.githubusercontent.com/tholden/7cf0b4b8ae2b6030b60b704766903612/raw/d3b7a3810936344906f90b7d62b506ff42af0dd1/SampleData.csv', col_types = cols(UPC_PRICE = col_factor(), WEEK = col_factor(), MOVE = col_integer(), STORE_COM_CODE = col_factor(), AGE = col_factor())) 


mod = fepois(MOVE ~ AGE | STORE_COM_CODE^UPC_PRICE + STORE_COM_CODE^WEEK, data = examp_dat1)
#> NOTE: 9/0 fixed-effects (394 observations) removed because of only 0 outcomes.
#> The variable 'AGE224' has been removed because of collinearity (see $collin.var).
  
  mod
#> Poisson estimation, Dep. Var.: MOVE
#> Observations: 9,605
#> Fixed-effects: STORE_COM_CODE^UPC_PRICE: 315,  STORE_COM_CODE^WEEK: 384
#> Standard-errors: Clustered (STORE_COM_CODE^UPC_PRICE) 
#>        Estimate Std. Error   z value Pr(>|z|) 
#> AGE3  -0.012467    11.6001 -0.001075  0.99914 
#> AGE4   0.049981    23.2149  0.002153  0.99828 
#> AGE5  -0.105345    34.8334 -0.003024  0.99759 
#> AGE6  -0.161140    46.4345 -0.003470  0.99723 
#> AGE7  -0.234467    58.0617 -0.004038  0.99678 
#> AGE8  -0.172549    69.6805 -0.002476  0.99802 
#> AGE9  -0.130779    81.2899 -0.001609  0.99872 
#> AGE10 -0.112788    92.8970 -0.001214  0.99903 
#> ... 324 coefficients remaining (display them with summary() or use argument n)
#> ... 1 variable was removed because of collinearity (AGE224)
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -12,241.4   Adj. Pseudo R2: 0.249849
#>            BIC:  33,928.0     Squared Cor.: 0.551105

What is happening is that you are treating age as a factor when you import the data so fepois is estimating coefficients for every level except the reference. If you are interested in the effect of age than all you need to do is either coerce it to a numeric or omit the Age = col_factor() when you import


examp_dat2 = read_csv('https://gist.githubusercontent.com/tholden/7cf0b4b8ae2b6030b60b704766903612/raw/d3b7a3810936344906f90b7d62b506ff42af0dd1/SampleData.csv', col_types = cols(UPC_PRICE = col_factor(), WEEK = col_factor(), MOVE = col_integer(), STORE_COM_CODE = col_factor())) 



mod2 = fepois(MOVE ~ AGE | STORE_COM_CODE^UPC_PRICE + STORE_COM_CODE^WEEK, data = examp_dat2)
#> NOTE: 9/0 fixed-effects (394 observations) removed because of only 0 outcomes.
  
mod2
#> Poisson estimation, Dep. Var.: MOVE
#> Observations: 9,605
#> Fixed-effects: STORE_COM_CODE^UPC_PRICE: 315,  STORE_COM_CODE^WEEK: 384
#> Standard-errors: Clustered (STORE_COM_CODE^UPC_PRICE) 
#>     Estimate Std. Error z value Pr(>|z|) 
#> AGE   1.3405    57551.2 2.3e-05  0.99998 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -12,567.5   Adj. Pseudo R2: 0.250126
#>            BIC:  31,544.9     Squared Cor.: 0.504288

For the setFixest_nthreads() For whatever reason if you want to throw all the available threads at the problem, then you need to set setFixest_nthreads(nthreads = 0).

Upvotes: 0

Related Questions