Reputation: 257
I'm creating a linear model using a very large amount of data (50 million lines) and the biglm package. This is done by first creating a linear model based on a chunk of data, and then updating the model by reading in more chunks of data (1 millions lines) and using the 'update' function from 'biglm'. My model uses year (factor with 20 levels), temperature, and a factor variable that is 1 or 0 called is_paid. The code looks something like this:
model = biglm(output~year:is_paid+temp,data = df) #creates my original model from a starting data frame, df
newdata = file[i] #This is just an example of me getting a new chunk of data in; don't worry about it
model = update(model,data = newdata) #this is where the update to the new model with the new data happens
The problem is that the is_paid factor variable is almost always 0. So sometimes when I read in a chunk of data, every value in the is_paid column will be 0, and I obviously get the following error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
So basically, I need a way for the model to accept an update without getting angry for not having two different factors in a new chunk of data.
One way I was thinking of doing this was to always have one line of real data with a '1' value for is_paid, and add it on to the new chunk. This way, there is more than one kind of factor, and I'm still adding real data. The code would look something like this:
#the variable 'line' is a single line of data that has a '1' for is_paid
newdata = file[i] #again, an example of me reading in a new chunk of data. I know that this doesn't make sense by itself
newdata = rbind(line,newdata) #add in the sample line with '1' in is_paid to newdata
model = update(model,newdata) #update the data
Here is an example of my data:
output year temp is_paid
1100518 12 40 0
2104518 12 29 0
1100200 15 17 0
1245110 16 18 0
5103128 14 30 0
And here is an example of my sample line, which is a real record where is_paid was 1:
output year temp is_paid
31200599 12 49 1
Would adding in the same line over and over distort the coefficients I get for my variables? I tested in on some dummy code, and it didn't look like updating a model with the same record over and over affects it, but I'm suspicious.
I feel like there is a far more elegant and intelligent way to do this. I've been reading R tutorials, and it seems like there is a way to set the contrasts for an lm model. I looked at the 'contrasts' argument in 'lm', but couldn't figure anything out. I don't think you can set contrasts in biglm anyway, which is what I need to use. I would really appreciate any insights or solutions you guys can think of.
*Comparison of numeric vs. factor variable for is_paid:
df.num = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = c(rep(0,each = 5),rep(1,each = 5)))
df.factor = data.frame(a = c(1:10),b = as.factor(rep(c(1,2,3,4,5),each = 2)),c = as.factor(c(rep(0,each = 5),rep(1,each = 5))))
mod.factor = lm(a~b:c,data = df.factor)
mod.num = lm(a~b:c,data = df.num)
> mod.factor
Call:
lm(formula = a ~ b:c, data = df.factor)
Coefficients:
(Intercept) b1:c0 b2:c0 b3:c0 b4:c0 b5:c0 b1:c1
9.5 -8.0 -6.0 -4.5 NA NA NA
b2:c1 b3:c1 b4:c1 b5:c1
NA -3.5 -2.0 NA
Call:
lm(formula = a ~ b:c, data = df.num)
Coefficients:
(Intercept) b1:c b2:c b3:c b4:c b5:c
3.0 NA NA 3.0 4.5 6.5
The conclusion here is that the model is changed if is_paid is numeric.
****I also slightly edited my model to instead look at the interactions of two factors rather than just three variables. This means I cannot treat is_paid as a numeric (I think)
Upvotes: 0
Views: 1604
Reputation: 145805
Turning Ben Bolker's comment into an answer, with evidence of some better-simulated data that it works.
Just treat your two-level factor a continuous. This is the same as treating it as a factor.
Example:
df.num = data.frame(a = rnorm(12),
b = as.factor(rep(1:4,each = 3)),
c = rep(0:1, 6))
df.factor = df.num
df.factor$c = factor(df.factor$c)
mod.factor = lm(a~b*c - 1,data = df.factor)
mod.num = lm(a~b*c - 1,data = df.num)
all(coef(mod.factor) == coef(mod.num))
# [1] TRUE
Upvotes: 2