Ore M
Ore M

Reputation: 257

lm in R: Workaround for 'contrasts' error

I'm creating a linear model using a very large amount of data (50 million lines) and the biglm package. This is done by first creating a linear model based on a chunk of data, and then updating the model by reading in more chunks of data (1 millions lines) and using the 'update' function from 'biglm'. My model uses year (factor with 20 levels), temperature, and a factor variable that is 1 or 0 called is_paid. The code looks something like this:

model = biglm(output~year:is_paid+temp,data = df) #creates my original model from a starting data frame, df
newdata = file[i] #This is just an example of me getting a new chunk of data in; don't worry about it
model = update(model,data = newdata) #this is where the update to the new model with the new data happens

The problem is that the is_paid factor variable is almost always 0. So sometimes when I read in a chunk of data, every value in the is_paid column will be 0, and I obviously get the following error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels

So basically, I need a way for the model to accept an update without getting angry for not having two different factors in a new chunk of data.

One way I was thinking of doing this was to always have one line of real data with a '1' value for is_paid, and add it on to the new chunk. This way, there is more than one kind of factor, and I'm still adding real data. The code would look something like this:

#the variable 'line' is a single line of data that has a '1' for is_paid
newdata = file[i] #again, an example of me reading in a new chunk of data. I know that this doesn't make sense by itself
newdata = rbind(line,newdata) #add in the sample line with '1' in is_paid to newdata
model = update(model,newdata) #update the data

Here is an example of my data:

output  year    temp is_paid
1100518     12     40   0
2104518     12     29   0   
1100200     15     17   0   
1245110     16     18   0 
5103128     14     30   0 

And here is an example of my sample line, which is a real record where is_paid was 1:

output  year temp is_paid
31200599 12  49     1

Would adding in the same line over and over distort the coefficients I get for my variables? I tested in on some dummy code, and it didn't look like updating a model with the same record over and over affects it, but I'm suspicious.

I feel like there is a far more elegant and intelligent way to do this. I've been reading R tutorials, and it seems like there is a way to set the contrasts for an lm model. I looked at the 'contrasts' argument in 'lm', but couldn't figure anything out. I don't think you can set contrasts in biglm anyway, which is what I need to use. I would really appreciate any insights or solutions you guys can think of.

*Comparison of numeric vs. factor variable for is_paid:

df.num = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = c(rep(0,each = 5),rep(1,each = 5)))
df.factor = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = as.factor(c(rep(0,each = 5),rep(1,each = 5))))

mod.factor = lm(a~b:c,data = df.factor)
mod.num = lm(a~b:c,data = df.num)

> mod.factor

Call:
lm(formula = a ~ b:c, data = df.factor)
Coefficients:
(Intercept)        b1:c0        b2:c0        b3:c0        b4:c0        b5:c0        b1:c1  
    9.5         -8.0         -6.0         -4.5           NA           NA           NA  
  b2:c1        b3:c1        b4:c1        b5:c1  
     NA         -3.5         -2.0           NA  


 Call:
 lm(formula = a ~ b:c, data = df.num)

Coefficients:
(Intercept)         b1:c         b2:c         b3:c         b4:c         b5:c  
    3.0           NA           NA          3.0          4.5          6.5  

The conclusion here is that the model is changed if is_paid is numeric.

****I also slightly edited my model to instead look at the interactions of two factors rather than just three variables. This means I cannot treat is_paid as a numeric (I think)

Upvotes: 0

Views: 1604

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145805

Turning Ben Bolker's comment into an answer, with evidence of some better-simulated data that it works.

Just treat your two-level factor a continuous. This is the same as treating it as a factor.

Example:

df.num = data.frame(a = rnorm(12),
                    b = as.factor(rep(1:4,each = 3)),
                    c = rep(0:1, 6))
df.factor = df.num
df.factor$c = factor(df.factor$c)

mod.factor = lm(a~b*c - 1,data = df.factor)
mod.num = lm(a~b*c - 1,data = df.num)

all(coef(mod.factor) == coef(mod.num))
# [1] TRUE

Upvotes: 2

Related Questions