Bas
Bas

Reputation: 1076

biglm finds the wrong data.frame to take the data from

I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)

My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.

library(biglm)

formula <- iris$Sepal.Length ~ iris$Sepal.Width

test <- iris[1:10,]

biglm(formula, test)

And somehow, I get the following output:

> test <- iris[1:10,]
> test
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150

> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size =  150 

Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?

Upvotes: 0

Views: 73

Answers (1)

Paul Hiemstra
Paul Hiemstra

Reputation: 60964

I suspect the following line is to blame:

formula <- iris$Sepal.Length ~ iris$Sepal.Width

where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).

In a formula you normally do not use vectors, but simply the column names:

formula <- Sepal.Length ~ Sepal.Width

This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.

Upvotes: 2

Related Questions