Reputation: 2507
I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_
) has NA
values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA
values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed) Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.
Upvotes: 0
Views: 263
Reputation: 159
Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.
Upvotes: 2