Emm
Emm

Reputation: 2507

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.

I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:

Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)

This is my code:

# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)

Here is a link to the data (compressed) Data

Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.

Upvotes: 0

Views: 263

Answers (1)

Just Burfi
Just Burfi

Reputation: 159

Similar to what MrFlick mentioned, you are somewhat short in RAM.

Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.

I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.

Hope this helps.

Upvotes: 2

Related Questions